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ABSTRACT 

Measuring cognit5.ve achievement gains in project 
evaluation is dealt with. This guide provides those concerned with 
project evaluation with the basic tools for conducting technically 
sound, interpretable evaluation studies. Exotic designs are avoided 
and five basic models which appear feasible to implement in 
real-world settings are focused on. Each is briefly summarized in 
terms of general characteristics, strengths and weaknesses, and 
considerations relating to its implementation. Twelve hazards which 
are commonly encountered in educational evaluation and which may 
completely invalidate otherwise sound studies are described. A 
procedural guide, in decision-tree form, for selecting a suitable 
evaluation model given the particular set of constraints faced by the 
project director and the evaluator is presented. The sections on 
implementation are intended for use by evaluation specialists and are 
somewhat technical. It is assumed that the reader will have had a 
least one college-level course in elenentary statistics. Ways of 
analyzing, interpreting, and reporting results are suggested, and 
details of recommended procedures are included. Appendices expand 
upon issues raised in the guidebook and characteristics of some 
widely used commercial reading and mathematics achievement tests are 
listed. (RC) 
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FOREWARD 



A series of monographs is being developed by the U.S* Office of 
Education to discuss methodological issues in the area of 
educational program evaluation. Integrating accepted statistical, 
research, and measurement practices, the monographs vlll serve as 
informative guidebooks to educational evaluators and administrators* 
As such, they will respond to the mandate by Congreus on the 
Commissioner of Education to develop and disseminata acceptable 
models for the evaluation of educational projects (Mucation 
Amendments of 1974, Public Law 93-380). 

This monograph, A Practical Guide to Measuring Project lopacg o n 
Student Achievement, is an outgrowth of work by RMC Research Corporation 
of Los Altos, California, on Contract OEC-0-73-6662 entitled, "Planning 
Study for the Development of Project Information ?:ckagas for Effective 
Approaches to Compensatory Education." It diecusses rive eva3.uation 
modelc and the methodological and procedural Implications of «ach# 
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I. INTRODUCTION 



Purpose and Scope 

The evaluation of any special instructional project is affected by 
decisions made at all levels of project administration and at all stages 
of planning and implementation. All too often an evaluation specialist 
is brought in after a pioject is well under way only to find that actions 
have already been taken which make it difficult, if not impossible, to 
perform any kind of meaningful impact assessment. 

To avoid this clearly undesirable situation, directors of educational 
projects need to be aware of the consequences their decisions may have 
for evaluation and to appreciate the need for working closely with their 
evaluators from the earliest planning stage. Thif idebook attempts 
to address the needs of project directors as well as evaluators, and the 
next section of this chapter specifically designates certain sections 
as "recommended reading" for project directors. 

The guidebook deals with only one central aspect of project evalua- 
tion, measuring cognitive achievement gains. It is not concerned with 
project costs or with any affective benefits which project participants 
may accrue- Neither does it address any such "process" variables as 
how well the objectives of the project were stated, how well the needo 
of the children were assessed, or how closely teachers followed prescribed 
instructional strategies. The entire focus is on obtaining as clear and 
unambiguous an answer as possible to the question, "How much more did pupils 
learn by participating in the project than they would have learned with- 
out it?" 

The guidebook is the result of a search by the authors for effective 
compensatory reading and mathematics projects (Tallmadge, 1974). The 
search encompassed some 2,000 projects, all of which had received eome 
form of "official" recognition for success. Of the 2,000, only six 
could be found which, under close scrutiny, were able to meet the selec- 
tion criteria of effectiveness, cost, availability, and replicabillty 
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established for this search (see Foat, 1974). Most discouraging, however, 
was the fact that not one of the evaluations provided acceptable evidence 
regarding project success or failure. In all cases, problems in conducting*; 
and reporting the evaluations rendered the results inconclusive. Obviously, 
practical considerations prevent school evaluators from doing controlled, 
laboratory experiments, but many of the problems in current evaluation 
practices could be avoided with little or no increase in cost or effort. 
The rigor of laboratory experimentation may be beyond reach, but the state- 
of-the-art can be greatly improved without placing unrealistic demands 
on schools or evaluation resources. 

The purpose of this guidebooR is to provide those concerned with 
project evaluation with the basic tools they need to conduct technically 
sound, interpretable evaluation studies. Every effort has been made to 
minimize the amount of technical sophistication required of users of the 
guidebook. It deliberately avoids exotic designs and focuses instead on 
five basic models which appear feasible to implement in real-world settings. 
Despite this orientation, it must be acknowledged that evaluation is not, 
and cannot be made simple. Particularly where situational constraints 
force adoption of statistical rather than experimental controls for 
extraneous influences, theoretical and computational complexities multiply 
at an astonishing rate. 

It seems likely that some potential users of the guidebook will find 
certain sections overly technical. On the other hand, those readers who 
can follow the more difficult pcrtions may find much that seems trivial 
or unnecessary. Perhaps the bef^t that can be hoped is that a reasonable 
compromise has been found between the inherent complexity of the total 
evaluation problem and the need to accomplish meaningful assessments with- 
out placing unreasonable demands on the technical expertise of the evaluator. 

Organization and Content 

The guidebook covers the evaluation process from the adminis- 
trative decisions in selecting an evaluation design to the details of 
collecting, analyzing, and reporting the data. Many of the details will 
be of interest primarily to the evaluation specialist, and the project 



director may skip those sections without detriment* The following para- 
graphs summarize the topics and Indicate the audience for whom each was 
Intended. 

The final sections of this chapter describe Evaluation Basics and 
Preliminary Planning . These sections are quite brief and should be read 
by project directors as well as those concerned with the details of pro- 
ject evaluation. 

Chapter II describes 12 hazards which are commonly encountered in 
educational evaluation and which may completely invalidate otherwise 
sound studies. Each of the hazards is named and then described very 
briefly. Material is then presented discussing why the hazard may in- 
validate impact assessment. Finally, there is a section on how the 
hazard can be avoided. 

The 12 presentations are not lengthy and should be read by both pro- 
ject directors and evaluators. As a minimum, project directors should 
read the summary statement of ei\ch hazard in order to recognize the 
practices and to realize that they must be avoided if a valid evaluation 
is to be done. 

Chapter III preserts a procedural guide for selecting a suitable 
ev'iuaLiou model given the particular set of constraints faced by the 
project director and evaluator* The entire procedure is presented in 
decision-tree form (see Flp.ure 1, p. 47) with each decision point rep- 
resented by a question followed by a choice of two alternatives (e.g.. 
Is a comparison group evaluation design feasible?). Each question is 
discussed on separate pages which describe the implications of the de- 
cisions and the alternative courses of action available to the evaluator. 

It is stiongiy recommended that the project director as well as the 
evaluator read Chapter III. Portions of some of the discussion sections 
become quite technical and may be skipped by the project director, but 
it is important that he be familiar <»*lth the evaluation options open to 
him and with the consequences of the declclons he must make. 

Chapter IV presents the five ev£;luatlon models referred to in 
Chapter III. There is a brief summary of each model which describes 

lit 



its general characteristics, its strengths, its weaknesses, and consider- 
ations relating to its implementation. The summaries should be carefully 
read by the project director. 

Each summary is followed by several pages of step-by-step instruc- 
tions for implementing the model except in one instance where the compu- 
tational procedures were judged too complex for inclusion. 

The sections on implementation are intended for use by evaluation 
specialists and are somewhat technical. It is assumed that the reader 
will have had at least one college-level course in elementary statistics 
and will be familiar with and able to compute means, standard deviations, 
and correlations. No further expertise should be required to follow the 
implementation procedures, although the underlying concepts and rationales 
may not always be understood. Consultation with a statistician is ad- 
visable for evaluators not familiar with the concepts of covariance and 
regression if models employing these statistical procedures are selected. 

The format of the guidebook is such that the design selection pro- 
cedure in Chapter III (decision-tree) will autoioatically lead the reader 
to only one of the five models described in Chapter IV. He would thus 
not need to read any of the other model descriptions. Preliminary ex- 
perience with these chapters, however t suggests that they are interactive, 
and that reading about the alternative models — particularly the se^'ions 
dealing with the considerations relevant to implementation — will often 
lead to a rethinking of the decision made in the design selection pro- 
cedure. )r this reason, at least a superficial reading of all of the 
model descriptions is recommended before a final model selection is made. 

Chapter V deals with the details of data collection and Chapter VI 
with summarizing and reporting of iDip-act data. These chapters neea not 
be of great concern to project director:* although a cursory review of what 
they contain might facilitate understanding and communication with the 
project evaluators « 

Several appendices are also provided which expand upon issues 
raised in the body of the guidebook. These appendices, of course, are 
intended primarily for evaluation specialists and need aot concern project 
directors . 
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Evaluation Basics 



To find out whether students do better In a special project than they 
would have done without It, the evaluator needs two things: a good measure 
of how the students performed after their project participation, and an 
accurate estimate of how they would have done without the project. The 
difference between these measures provides an index of the project's Im- 
pact. In order to get a good measure of how students performed, che 
evaluator must select an appropriate test and ensure t' adminis- 
tered and scored correctly. Often, the catalog of avb.*i. ve tests will 
not Include one with exactly the characteristics desired for assessing a 
particular project. However, most standardized reading and math tests 
are sensitive to any significant cognitive growth and should usually 
prove adequate for assessing the impact of special treatments. Objective- 
referenced or criterion-referenced tests are also suitable assuming that 
they have been carefully constructed. Tests and testing are discussed 
further in Chapter V of this guidebook and in Appendix A. 

A more difficult problem lies in estimating how students woulr* bave 
done without the project. In university laboratory studies, the experiences 
of randomly selected comparison groups are controlled so as to be identical 
to those of the experimental group In all respects except for the variable 
of Interest. This approach is rarely a viable option in school projec«:.<i. 
A variety of substitute approaches are commonly used but all are in varying 
degrees less satisfactory. The worst of these alternatlvtis are included 
in Chapter II as "hazards" and make evaluations meaningless. The best 
are included in Chapter IV with recommendations on when they should be 
used and explanations of theii strengths and limitations. 

Chapters V and VI also suggest, as mentioned above, ways of analy- 
zing, interpreting, and reporting results. Details of recommended procedures 
are included there while characteristics of some widely used commercial 
reading and mathematical achievement tests ar^i Included in Appendix A. 

Prellid.nary Planning 

Ideally, the planning of an evaluation should proceed concurrently 
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with the planning of the project to be evaluated. Obviously, this is 
not always possible, but it is important to be aware of the fact that 
some project decisions have important implications for evaluation, and vice- 
\ rsa. Project-related decisions may, in fact, preclude the possibility 
of conducting any meaningful kind of evaluation • 

One area where close coordination between project and evaluation 
planning is absolutely essential is thac of selecting project participants • 
Several possibilities exist: 

(a) all children comprising a particular group (e,g,» all third 
graders) may be given a special supplementary project; 

(b) participants may be randomly selected from an identifiable 
group or population; or 

(c) eligibility for participation may depend on the special 
needs of some members of a larger group (e.g., disadvantaged^ 
gifted) • 

Each of these alternative participant selection plans fits one or more 

of the models presented later in this guidebook, but is incompatible with, 

or places special restrictions on others. 

A second area where coordinated planning is required is the matching 
of evaluation models with test instruments. Criterion-referenced tests 
can be used with all but the norm -referenced model which requires stan- 
dardized tests. The noim-ref erenced model not only requires that stan- 
dardized tests be used but that the same level of the same test be used 
for both pre- and posttesting and that the testing be accomplished at 
exactly prescribed times during the year. 

When a project director makes an "executive" decision to use a 
specific type of test or some particular method of selecting participants, 
he severely limits the number of evaluation models which can be used and 
may substantially reduce, the conclusiveness of his assessment as well. 
The assumption made in this guidebook is that his first concern will be 
for conclusive findings. Accordingly, he will wish to consider the 
feasibility, practicality, and limitations of the more scientifically 



sound evaluation models before restricting his choices through hasty de- 
cisions about tests or participant selection procedures. In accordance 
with this orientation, the model selection procedure illustrated in Figure 
1 presents the models arranged in order of decreasing rigor from the top 
to the bottom of the figure so that the evaltxation planner can see the 
consequences of each of his decisions. 

Once a model is selected through the decision-tree process, the 
evaluation planner can read about its strengths and weaknesses and about 
the conditions and restrictions associated with its use. Careful study 
of the remaining four models may suggest alternatives that appear more 
desirable. At that time, he might deciaa to reject his first choice, 
re-enter the decision tree, and select another model. 

The decision points in the model selection procedure all relate 
to the manner in which no-treatment, posttest performance expectations 
are generated. Even where the most rigorous model is selected, however, 
there are many possibilities for implementation errors which could inval- 
idate the entire evaluation. The next section of this guidebook des- 
cribes twelve of the most commonly encountered hazards, their consequences, 
and what should be done to avoid them. These common hazards .should be 
studied carefully before any evaluation is undertaken. 
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II. COMMON HAZARDS IN EVALUATION 



This section deacribes twelve common hazards in evaluation, the 
problems they create » and the ways in which the problems can be avoided. 
The occurrence of any one of the twelve may completely invalidate an other- 
wise sound evaluation. The hazards include the following: 

1. The use of grade-equivalent scores. 
2* The use of gain scores. 

3. The use of no-^m-group comparisons with Inappropriate test dates. 

4. The use of Inappropriate levels of tests. 

5. The lack of pre- and posttest scores for each project partici** 
pant* 

6. The use of non-comparable treatment and comparison groups. 

7. The selection of project participants based on pretest scores. 

8. The assembling of a matched comparison group after the project 
participants are selected. 

9. The careless adc:inistration or scoring of testis. 

10* The assumption that an achievement gain is due to the treat- 
ment when, in reality, it could be due to some other factor. 
11* The use oi: non-comparable pretests and posttests. 
12* The use of inappropriate formulas to estimate posttest scores. 

Although subsequent sections of this guidebook refer to specific 
hasards, it is strongly recommended that the reader study all of the haz-- 
ards before going to other material. 
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Hazard 1 



The use of grade-equivalent scores* 



Grade-equivalent scores provide an insensitive, and in some instances sys- 
tematically distorted, assessnient of a project's impact. 



Why is this a hazard ? 

There are three serious problems with grade-equivalent scores: 

A, The concept of a "grad-^-equivalent" score is misleading. For 
exan5>ie: a grade-equivalent score of seven attained by a fifth grader on 
a math test does not mean that he knows sixth- and seventh-grade math. It 
is more accurate to say that ha can do fifth-grade math as well as an 
average seventh grader can do fifth-grade math although even this repre- 
sentation is not strictly accurate. It is quite possible, in fact, that 
when the test was normed, no seventh graders ever took the level of the 
test intended for use in the fifth grade. In such cases, the seventh- 
grade grade-equivalent scores reported in the test manual are simply sta- 
tistical projections and tell us little about how seventh graders would 
have actually scored if they had taken the fifth-grade test. 

B. Grade-equivalent scores do not comprise an equal-interval scale. 
That is, a grade-equivalent score of two is not in any sense "half" of 

a score of four. For this reason, "average" grade-equivalent scores are 
not consistent with averages computed from more appropriate kinds of 
scores and are not interpretable. 

Ct The normative data for many commercial tests are collected 
during one short interval of the school year, often in February or March. 
In order to establish norms for fall and spring, a smooth curve is drawn 
connecting the points which represent actual data. Unfortunately, there 
is substantial evidence that learning does not proceed uniformly over 



the calendar year. One factor that contributes to the irregular learning 
pattern is the effect of reduced gains, or even forgetting, during the sum- 
mer months. As a result, the current procedures used to generate grade- 
equivalent scores tend to make them systematically Uoo low in jhe fall and 
too high in the spring. Fall to spring gains thus appear to make a pro- 
ject look unusually effective when actually the gaine were exactly the 
same as would be expected under normal classroom conditions. Even where 
using grade-equivalent scores does not introduc systematic biases (as 
would be the case if there were a 12-month pre-to-posttest interval), the 
curve-fitting procedures used to generate such scores Introduce errors 
large enough r.o invalidate any evaluation. 

How can the hazard be avoided? 

There is never any technically sound reason for using grade-equivalent 
scores in evaluating projects, and they should be avoided. Standardized 
tests can still be administered. However, raw scores should be converted 
to standard scores instead of grade-equivalent scores before summary 
statistics are computed. Mean pretest and posttest standard scores should 
then be converted to their percentile equivalents. Finally, group pre-to-post- 
test gains can be compared against e ^pectations derived from the national 
norms, but only if the tests were administered at appropriate times (see 
Hazard 3). Further discussions of the problems created by using grade- 
equivalent scores for evaluation purposes can be found in Tallmadge and 
Horst (1974, Appendix E). 
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Hazard 2 



The use of gain scores. 



There are several kinds of gain scores* and they are generally used in an 
attempt to adjust for initial differences between treatment and comparison 
groups in conventional experimental designs. Where such differences are 
real, they cannot be adequately adjusted. Where they are the result of 
random sampling fluctuations, "raw" gain scores overcorrect for between- 
group differences, and "residual" gain scores are likely to undercorrect . 



Why is this a hazard ? 

The most commonly encountered type of gain score is the "raw" gain 
score which is simply the posttest score minus the pretef: score. (The 
term "raw" does not refer to the type of pretest or postlest scores—raw, 
standard, percentile, etc. — used to determine the gain, lat, rather, to 
the gain itself. The category of raw gain scores thus includes grade- 
equivalent gains .) 

If differences between treatment and comparison groups are random, 
(i.e., the two groups may be regarded as random samples from a single 
population), then raw gain scores overcorrect for pretest differences by 
excessively inflating the posttest performance measure of the initially 
inferior group. Analysis of covariance provides an appropriate means of 
adjusting for random pre-treatment differences between groups. 

In certain theoretical situations where differences betv^.en treat- 
ment and comparison groups are real (i.e., the groups are samples from 
different populations) , gain scores may represent the best method for 
equating the groups. In real-world educational evaluations, howc/er. 
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factors such as differential growth rates, different test score relia- 
bilities as a function of achievement level, different reliabilities on 
pre and post measures, and test floor effects all work to invalidate 
this type of adjustment (Campbell, 1974). The authors reject statistical 
techniques for equating truly non-comparable groups in conventional experi- 
mental designs. Comparison of such groups permits defensible conclusions of 
effectiveness only in those rare instances when an initially inferior treat- 
ment group outperforms the initially superior comparison group on the 
post test. 

A " residual" gain score is not a gain score at all. It is the 
difference between an actual posttcst score and an estimated posttest 
score where the estimate has been derived from the pretest scores using 
regression techniques. Whenever there is a pretest difference between 
treatment and comparison group means, residual gain scores systematically 
undercorrect. The amount of under correct ion is directly proportional to 
the size of the between-group difference. 

Tallmadge and Horst (1974, pp. 36-37) presents a further discussion 
on gain score problems. 

How can the hazard be avoided ? 

Gain scores should never be used. Where pretest scores are equal 
for treatment and comparison groups, there is, of course, no need for the 
kind of adjustment gain scores are supposed to provide. Where between- 
group differences result from random sampling fluctuations, covariance 
analysis is the appropriate technique to use. Where the differences are 
real, and the groups are truly non-comparable, there is no adequate tech- 
nique for equating them, and conventional comparison-group evaluation models 
should not be used. Appropriate alternative models are reconrmended in 
Chapter III. 
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Hazard 3 



The use of norm-group comparisons with inappropriate test dates. 



Administration of tests on dates which do not correspond to the dates 
when the actual normative data were collected invalidates norm-referenced 
comparisons. 



Why is this a hazard ? 

When conjparison groups are available, few evaluators would even 
consider testing treatment and comparison students more than a few days 
apart. When a norm group is used for comparison, however, this issue 
appears to be given little thought. The problem stems from two misleading 
practices follovjed by f:est publishers. First, interpolation or extrapo- 
lation processes are used to "create" norms for periods when no "real" 
normative data were collected. Thus, most publishers provide norms for 
fall, winter, and spring even though data were collected at only one or 
possibly two of these points. Projected norms are generally based on 
the assumption of linear cognitive growth over each month of the nine 
months of the school year with one additional month's gain over the three 
summer months. There is no evidence to support this assumption, and the 
created norms are likely to be far enough off to give a distorted impression 
of the impact of special instructional projects. Norms based on projected 
estimates should never be used for ■ fall-to-spring evaluations. 

The second practice is the suggestion, implicit in most norms tables, 
that the norms are valid over a three- or even a four-month period. For 
th^s to be the case, children would have to learn nothing over the entire 
p -iod, then show a large gain overnight at the end of the period, and so 
on. This assumption is clearly absurd. If the norms are correct some- 
where in the middle of the time period, they will be systematically too 

13 
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low at the beginning and systematically too high at the end of the period* 
The errors involved are quite large and can give a severely distorted 
picture of project impact. (See Tallmadge & Horst, 1974, Appendix D, 
p. 67.) 

How can the hazard be avoided ? 

It is absolutely essential to test children in the treatment con- 
dition within d few weeks of the dates on which the norm groups were 
tested. Tests which provide normative data for only one point in the 
year should not be used for norm -referenced evaluation of fall-to-spring 
gains. Instead, it is better to select a test with normative data in both 
fall and spring even though the choice of tests is then limited. Basically, 
it is never advisable to extrapolate or interpolate very far from observed 
normative data. 
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Hazard 4 



The use of Inappropriate levels of test 8 « 



If most of the pupils tested are getting nearly all, or hardly any, of the 
test items correct, the level of the test is inappropriate for assessing 
their cognitive achievement status. Measurement under these conditions 
is both unreliable and invalid. Ideally, the pupils tested should score 
In the middle of the range of possible raw scores. 



Why is this a hazard? 

The major standardized achievement tests are divided into several 
levels which cover different grades or grade bands. Each level is an 
individual test appropriate for only two or three grades. In the case 
of projects aimed at slow or fast learners, the test level nominally 
desi^jiated for their grade is likely to be too difficult (pupils will 
encounter the test "floor") or too easy (pupils will encounter the test 
"ceiling") and would not provide a reliable and valid measure of achieve- 
ment. Celling and floor effects may cause similar distortions in evalu- 
ations using criterion-referenced tests. 

How can the hazard be avoided? 

Test levels should be selected on the basis of the achievement 
levels of the pupils, not on the basis of their grade in school. Usually, 
one level above or below that nominally recommended for a particular 
grade will be sufficient to avoid celling and fj.oor effects, but no firm 
recommendation can be made as difficulty levels and ranges cf coverage 
vary greatly from instrument to instrument. 

Using test levels other than those nominally recommended for 
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particular grade levels is likely to mean that norms tables for the grades 
tested are not included in the test manuals. This is unfortunate since 
it is clearly not meaningful to assess either status or growth through 
comparisons with children at a different grade level. The stacus of a 
sixth grader should be assessed using sixth-grade norms even if he is 
tested with a fourth-grade test. If a comparison group is available t 
there is no problem because growth is assessed with reference to the com- 
parison pupils — not to the norms. With norm-referenced evaluation models, 
however, there may be a problem. Fortunately, most major test publishers 
have interlocked their test levels by providing overlapping grade-level 
coverage. This practice has enabled the development of score equivalen- 
cies between adjacent test levels so that it is possible to predict quite 
accurately from a pupil's score on one test level how he would liave scored 
on the next higher or Lower level. 

From the between-level score equivalencies, it is common practice 
to develop a single score scale spanning all test levels so that raw 
scores from any level can be converted to scores on this scale. (Scores 
of this type are often called scale, standard, or expanded standard scores.) 
Scale scores can be referenced to any set of normative data. Thus, scores 
of sixth graders tested with a fourth-grade test can be converted to sixth- 
grade percentiles, so it is not necessary to use a test which is likely 
to be too easy or too difficult for the particular children being tested. 
While there are generally some measurement errors which result from im- 
perfect interlocking, typically they are smaller than those which result 
from encountering test ceilings or floors. 

Whatever level of a test is selected for use, that same test level 
should be used for both pre- and posttesting (see Hazard 10). 
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Hazard 5 

The lack of pre- and posttest scores for each treatment participant. 



Analyses of project impact should be based only on those participants with 
both pre- and posttest scores. Interpretation of these data, however, 
should take into consideration the characteristics of pupils who dropped 
out, entered late, or graduated from the project. 



Why is this a hazard? 

In most projects, the group that is ultimately posttested is not 
composed of exactly the same students as the pretest group due to drop- 
outs and new students during the school year. Therefore, pre- and post- 
test mean scores are not strictly comparable. In particular, it often 
seems that the dropouts from a special program are among the slowest 
students. Eliminating their low scores from the posttest may raise the 
mean posttest score considerably. On the other hand, some projects 
may return successful students to their regular classrooms, thus lowering 
the mean posttest score for the remaining group. It is not uncomipon to 
find evaluation reports which include posttest scores for fewer than half 
of the reported project participants; any conclusions in such reports are 
usually meaningless. 

How cap the hazard be avoided ? 

It is not possible to prevent students from dropping out or entering 
a project after it has begun. Still, it is essential to base any con- 
clusions about the impact of the project on the data from students who 
have both pre- and posttest scores. Furthermore, the pretest score 
distribution for all dropouts must be examined to see if it differs from 
that of the non-dropouts. Also, if the number of dropouts is large, at 
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least a brief investigation of the reasons for dropping out is required. 
Sometimes a project is targeted at certain children* and the dropouts 
may be either students who succeeded and returned to their regular 
classes leaving the unsuccessful students to be posttested; or they may 
be exactly the students for whom the program was intended, but who failed 
and left. Informal ion of this nature is crucial, therefore, to the 
evaluation of the program. 

In short, every effort must be made to obtain pre- and posttest 
scores for each project participant. Pretest-posttest comparisons riiust 
be based on those students for whom both scores are available. Data 
from students having only pretest or only posttest scores must be care- 
fully examined to see if they differ in some systematic respect from the 
data of students having both pre- and posttest scores. A description of 
any of these differences should be included in the project evaluation. 
Analysis of pre- and posttest scores is discussed further in Chapter VI. 
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Hazard 6 



The use of non^^conqparable treatment and cocsparison groups. 



In conventional experimental designs, treatment and comparison groups must 
be comparable in all relevant variables before the treatment begins. 
Groups which differ in terms of pretest scores are an obvious source of 
bias. Other, more subtle factors such as differences in age, sex, race, 
or socioeconomic status can also exert strong biasing Influences and must 
be avoided* In such designs, there is no way in which a non-'Comparable 
comparison group can provide an accurate estlmace of how well the treat- 
ment group would have done without the treatment. 



Why is this a hazard ? 

Students in a special program may do better or worse than comprri- 
son groups simply because they were different to start with. One of the 
most common cases occurs when students who volunteer are put in the 
special program while the rest serve as a comparison group. Even given 
equal pretest scores, it is likely that the volunteers are a more enthu- 
siastic gjToup and will learn more. This type of rather subtle difference 
is often overlooked. Of course, any obvious differences betvec^n treatment 
and comparison group' may also affect evaluation results, so such variables 
as socioeconomic status, age, sex, racial and ethnic compoftition, and 
school size and setting should be carefully checked for comparability. 

The problem is even more serious when norm-based comparisons are 
used. Volunteering or other selection procedures may result in a treat- 
ment group that is quite different from the norm-group students who got 
equal scores at pretest time. 

The net result in either case is that the comparison group provides 



ERLC 



19 

Z7 



an inaccurate estimate of what project participants would have learned 
without the project treatment* Theoretically, the estimate may be either 
too high or too low. 

How can the hazard be avoided ? 

Students should be assigned to treatment and comparison groups on 
a random basis or in sucl. a wpy that a nonrandom assignment is random 
In effect (Lord, 1967, p. 38). Essentially, this means that th^ two 
groups must be similar along all educationally relevant dimensions, un- 
less the evaluation model specifically calls for selection on the basis 
of a pretest cutoff score. This hazard and the steps to avoid it are 
closely related to the previously discussed Hazard 2. 
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Hazard 7 



The selection of project participants based on pretest scores. 



When students are selected for project participation based on their ob- 
taining relatively high or relatively low scores on some test, use of 
those scores as pretest measures invalidates any kind of norm-referenced 
evaluation. The procedure altu^ invalidates conventional experimental 
designs unless the comparison group students are selected in exactly the 
same manner (see Hazard 8). 



Why is this a hazard ? 

This error has been so widely discussed and v/ell documented that 
most evaluators are aware of the problem. Unfortunately, for various 
reasons it is still encountered. The error results from testing a large 
group of students, selecting the lowest (or highest) ones for a special 
program, and then treating the selection scores as pretest scores. This 
practice results in systematic distortions of pre- to posttest gains. 

It is well known that if the low scoring students are retested on 
the same or a comparable test, they will score higher on the average, 
while an initially high scoring group will score lower. This phenomenon 
is called "regression toward the mean," cr simply "statistical regression," 
and is disriussed in virtually all texts on experimental design. The result 
is chat low scoring groups appear to learn more from a special program 
than they actually do, while gains in special programs for high scoring 
students may be obscured. 

Statistical regression presents no problem for the special and 
general regression models presented in Chapter IV. It invalidates any 
kind of nonn-rcferenced evaluation. Evaluations employing comparison 
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groups may or nay not be affected, depending on whether the regression 
effect operates differently on the two groups. Hazard 8 treats a closely 
related situation in which the comparison group is selected on the basis 
of pretest scores. 

How can the hazard be avoided ? 

Corrections for the regression effect are possible in theory, but 
in practice the neceasary iata are not usually available. Thus, it is 
safer to avoid the problem by not using the pretest to select project 
participants except for those regression models which specifically re- 
quire this approach. (See also Step 7, p. 23 of Tallmadge & Horst, 1974.) 
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Hazard 8 



The assembling of a matched comparison group after the project participants 
are selected* 



Finding "matches" for treatment participants in some other group is a 
fundamentally unsound practice. Unless they and the treatment pupils 
are equally representative of the groups from which they are drawn, sta- 
tistical regression will act differentially on the two groups and arti- 
ficially inflate the apparent gains of one group with respect to the other. 



Why is this a hazard? 

It may be very useful to have a comparison group made up of students 
caref-olly matched to the treatipent students, but unless the proper pro- 
cedures for selection are followed > comparisons between the two groups may 
be completely misleading. The common practice of selecting students for 
the treatment, then trying to find a non-treatment student to match each 
treatment student is a serious evaluation error« If, for example, a pro- 
ject is set up for the most underachieving children in a disadvantaged 
school, it may be possible to construct a "matching" comparison group by 
finding children with equally low pretest scores in less disadvantaged 
schools* In this situation, the comparison students would be farther 
below the means of their own schools than the treatment children, so their 
posttest scores would show a greater regression toward the mean* This 
regression artifact would thus inflate the apparent gains of the compari- 
son group with respect to the treatment group and might obscure a real 
project impact. 

How can the hazard be avoided? 

The correct procedure for establishing matched comparison groups 
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l8 to do the matching first and then assign members of each pair randomly 
to the treatment or the comparison group. That Is, a large group of stu- 
dents, all eligible to be In the project, must be available. The first 
step Is to divide the group Into matched pairs based on test scores, ethnic 
background, sex, etc*, so that the two members of each pair are as similar 
as possible. Then, after the matching pror-^ss Is complete, some random 
procedure such as flipping a coin Is used to decide which member of each 
pair goes Into the treatment and which into the comparison group. Where 
this approach is impossible, models which do not require matched groups 
should be delected. (See Chapter III.) 
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Hazard 9 



The careless administration or scoring of tests. 



Testing must be accomplished ^th scrupulous attention to detail. For 
most evaluation models, the primary requirement is that treatment and 
comparison groups be tested in exactly the same way. The norm-referenced 
evaluation model further requires that procedures outlined by test pub- 
lishers be followed precisely. 



Why is this a hazard? 

Problems arise if tests are administered or scored in an inconsis-* 
tent and careless manner. If there are differences in the ways in which 
the treatment students and the comparison students are tested or if there 
are differences in the procedures, conditions, and scoring at pretest and 
posttest times, then it is impossible for the resulting data to accurately 
reflect project impact. No amount of careful statistical analysis after 
the fact can overcome these problems. 

Hov can the hazard be avoided? 

(a) Test procedures must be orderly and accurate if scores are to 
be meaningful. 

(b) The treatment students must be tested and scored in exactly 
the same way as comparison students. 

(c) The procedures, conditions, and scoring methods during post- 
testing must be exactly the same as during pretesting. 

Properly trained personnel decrease the probability of disorderly 
or inaccurate testing procedures but problems may be introduced by local 
conditions and student attitudes. Students may not understand what is 
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expected of then, or ia extreme cases, they may become unruly and make no 
serious effort to answer test questions • Problems which occur due to care- 
lessness include failing to get the right name on each answer sheet , using 
the wrong answer key or conversion tables, and making mistakes in copying 
scores onto data sheets. 

The second issue, comparability between the testing situations of 
the treatment and comparison groups, can and should be dealt with in a 
straightforward manner in comparison-group designs. In these caseo, iden- 
tical procedures, even the use of a single tester, are possible. In the 
more common situations in which norm-group comparisons are made, the 
instructions accompanying the test must be followed exactly. 

The third issue, comparability between pre- and posttesting situa- 
tions, requires the same attention to procedures as the other issues. The 
real problem is often the pressure on teacbt-rs to show achievement gains 
which may lead them, intentionally or unintentionally, to be stricter in 
enforcing time limits and avoiding helpful hints on the pretest than when 
administering the post test. This type of problem can be minimized by 
having an independent, external evaluator administer the tests or by having 
teachers within a school exchange classrooms so that aach tests and scores 
another teacher's students. 

Chapter V is devoted entirely to the details of obtaining accurate, 
meaningful data. 
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Hazard 10 

The assumption that an achievement gain is due to the treatment when, in 
reality, it could be due to some other factor. 



Other possible explanations always exist for observed gains. The plaus- 
ibility of these alternative explanations should be carefully examined 
before gains are attributed to project impact. 



Why is this a hazard ? 

Sometimes project participants learn substantially more than would 
have been expected, but the project, per se , is not responsible* Insteac. 
the gains could be a result of the Hawthorne effect (Whitehead, 1938) in 
which special project participants do well simply because they are getting 
special treatment. The nature of the treatment may not necessarily be 
important. An opposite result may follow from a John Henry effect 
(Saretsky, 1972). In this case, comparison-group students work extra hard 
to prove that they are ju3t as good as project students. 

Other likely causes of misleading gains are unrecognized "treatments** 
which have nothing to do with the project. Most school systems are in a 
constant state of flux with multiple changes every year. Changes in school 
programs, personnel, facilities, class sizes, community characteristics—any 
or all of these factors can affect student performance. Also, the true 
source of achievement gains is sometimes lirj>roperly identified because 
children are involved in more than one treatment. Under these conditions 
it is impossible to determine causality in an unambiguous manner. 

Hov can the hazard be avoided? 

When a carefully implemented evaluation reveals significant cog- 
nitive achievement gains, it should not be immediately assumed that the 
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gains are solely th^ result of the special treatment* A variety of other 
ifactors exist which tuay lead to the obtained results* Each plausible 
rival hypothesis should be examined and, where the evidence permits, elim'- 
Inated as a likely explanation* A discussion of the remaining factors 
and the relative likelihood of each as a contributor to the gains should 
be Included in the evaluation* In succeeding years with a continuing 
project, some of these competing explanations might be controlled and 
eliminated* 
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Hazard 11 



The use of non-cotuparab] p pr«».te8t and post test* 



It Is almost always a good Idea to use the same level of the same test 
for both pre- and posttestlng* In norm-referenced evaluations, it is 
usually essential* 



Why is this a hazard? 

The situation in which pretests differ from posttests is frequently 
encountered in evaluation reports* Usually it occurs because there is 
a dlstrict-^wlde change in testing policy during the evaluation period in 
an attempt to find a more appropriate test for all district evaluations* 
The disruption of evaluations of ongoing projects is unavoidable, and 
may be completely beyond the control of the project evaluator* It may 
also, however, severely limit the usefulness of the evaluation and should 
be avoided if at all possible* The use of the same level of a test for 
both pre- and posttestlng is also strongly advised* Some tests have inter-* 
locked levels so that scores from one test level can be converted into 
another* However, these conversion tables reflect a certain degree of 
measurement error as a re&ult of curve fitting, rounding, and successive 
transformations* It is clearly preferable to use just one level of the 
test* 

In a comparisonr group design, the fact that the posttest differs 
from the pretest may not be a critical problem* So long as pre- and post- 
tests are reasonably correlated, as will be true among the major commer- 
cial tests, the comparison- group students make reasonably convincing con** 
elusions possible* However, in the more common norm-referenced designs, 
there is no completely adequate way to compare pretest scores oil one test 
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with posttest scores on a completely different test. Since each test is 
normed on a different group of students, this amounts to using one coii5)ar- 
Ison group for the pretest, and a second comparison group for the post- 
test. 

How can the hazard be avoided? 

To Insure comparability between the pre- and posttests* In norm-ref- 
erenced evaluations, the only real solution Is to administer the same 
level of the same test on both occasions. When that option Is not available 
It still may be possible. In some Instances, to approximate It through 
the use of conversion tables provided in the Anchor Test Study (Loret, Seder 
Blanchlnl & Vale, 1974). The Anchor Test Study provides tables which 
vsiay be used to convert scores on one test to their equivalents on each 
of the other tests In the study. Conversion errors are reported to be 
low, so In theory the procedure Is sound; but. In any case. It applies 
only to the eight most commonly used reading tests covered by the study, 
and only to grades 4, 5, and 6. 

In comparison group evaluations* switching from one standardized 
test to another is acceptable If both tests meet the requirements of this 
guidebook. The result is usually to lower pretest-posttest correlations 
and correspondingly to lower precision of the evaluation. Switching to 
an entirely dissimilar test is strongly discouraged. 
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Hazard 12 



The use of inappropriate formulas to estimate posttest scores. 



Under certain circumstances, it makes sense to expect that a pupil will 
maintain his relative status with respect to national norms from pre- to 
posttest if he does not participate in a special project. However, many 
methods have been devised for calculating performance level expectations 
which rest on clearly untenable assumptions. These methods of estimating 
performance levels should never be used. 



Why is this a hazard ? 

Many projects use an unrealistic theoretical model or formula to 
calculate "expected" posttest scores from IQ or other pretest scores. If 
students do better than the calculated expectation, the project is con- 
sidered a success. Estimated posttest scores are often based on average 
grade-equivalent scores. For example, a student who has gained 0.7 years 
per year, on the average, since beginning school is presumed to continue 
at the same rate unless a special program increases his rate. There are 
many problems with such an estimate, but the major one is in the use of 
grade-equivalent scores (see Hazard 1), The student who averaged 0.7 
years per year over several years will usually appear to gain more than 
that if measured from fall to spring, giving a misleading impression of 
improvement. 

Most IQ-based estimates are both inaccurate and logically unreas- 
onable. For example, the Bond-Tinker formula (Della-Piana, 1968, p. 41) 
is often used to compute an "expected" reading level, i.e., 

IQ 

Expected reading level = [ ] x [ No. of years in school ] + 1. 
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For a student with an IQ score of 85 (approximately one standard deviation 
below the mean) at grade level 7.1 (6.1 years of school coii5)leted) : 

Expected reading level « (.85) x (6.1) + 1 » 6.2 

So the formula says he should be reading at the sixth-grade level. But 
since his IQ is supposed to be "mental age" divided by "chronological age," 
his mental age would be given by: 

MA « (IQ) x (CA). 
Assuming the seventh-grader is twelve years old: 

MA « (.85) X (12) « 10 years. 

We now have a twelve-year old student with a mental age of ten years who 
is expected to read as well as an average sixth grader (11 years old). 
This is certainly inconsistent, but even worse, it is incorrect. According 
to normative data from the Gates-MacGinitie reading test, a seventh-grade 
student one standard deviation below the mean is reading at the fourch- 
grade level. 

Because of these and many other theoretical and practical problems, 
the underlying concepts of the intelligence quotient have been abandoned 
by measurement specialists (Cronbach, 1970, p. 216; Tyler, 1972, p. 177). 
While the commercial instruments which have been designed as "IQ tests" 
may have a variety of practical uses, they are not, in general, the best 
available predictors of specific school skills. Therefore, IQ scores 
are not recommended for any purpose in evaluating the effects of special 
proj ects. 

How can the hazard be avoided? 

In norm-referenced evaluation models, posttest scores can be sti- 
mated by referring to national norms. When comparison groups are used, 
the actual posttest scores of these groups, or a regression equation esti- 
mating the posttest scores, provide the proper basis for evaluating treat- 
ment effects. 
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III. A PROCEDURAL GUIDE FOR MODEL SFLECTION 



This section presents a procedural guide for selecting an evaluation 
model. By answering a series of questions relating to the real-world con- 
straints under which the evaluation will be conducted, the reader is led 
to one of the five evaluation models presented in the following chapter. 

Figure 1 on page 47 summarizes the seven-step decision tree i : flow- 
diagram form. Each step is discussed separately on the pages preceding 
Figure 1. (This page arrangement is intended to facilitate reference to 
the fold-out figure.) For each step, the decision question is presented 
with two answer alternatives. A "comment" section is also included which 
explains the issue in question and the implications of choosing each alter- 
native course of action. 

The specific path to be followed through the decision tree depends 
on the answers the re^ider makes to each of the seven questions, and in- 
structions on how to proceed are provided for each answer alternative. 
The reader should first read through the chapter and then make a selection 
by skipping from page to page in accordance with these instructions. 

Figure 1 also shows the five evaluation models which are discussed 
in Chapter IV. They are arranged in decreasing order of scientific rigor, 
with those at the top of the page enabling the evaluator to draw substan- 
tially more conclusive inferences about project impact than those at the 
bottom. On the other hand, the feasibility of implementation is expected 
to operate in exactly the opposite direction so that the less rigorous 
models will be much easier to use. While the more rigorous models are 
certainly to be preferred, any one of the five will yield believable re- 
sults if carefully implemented. 



ERLC 



33 

41 



Question 1 

Do practical considerations (policy, availability, cost, time) 
permit you to select an evaluation design which makes use 
of a local comparison group? 

Yes Proceed to Question 2 
No Go to Model 5, page 72 

Comment 

In order to measure the Impact of any special instructional 
treatment, it is essential to have some estimate of how 
the participants' would have fared under normal or n/,n- 
ti-eatment conditions. Since, ^umably, the non-treatment 
condition consists of participat n in a regular school 
curriculum, some gains vould^iearly be expected even with- 
out the special project. The problem is to obtain a good 
estimate of how large the pupils* gains would have been 
under such conditions and subtract this estimate from the 
gains they actually obtained in the special project. The 
difference is the incremental gain which can be attributed 
to project participation. 

There are two kinds of local comparison groups which can 
provide adequate estimates of non-treatment expectations; 
(a) a conventional comparison group which is like the treat- 
ment group in all educationally relevant respects, and (b) 
a comparison ^roup which results from splitting an 
intact group into treatment and comparison subgroups at 
some pretest cutoff score. 

The best method of estimating non-treatment posttest scores 
is to find a group of pupils exactly like the project 
children and to treat them ir exactly the same way with 
the single exception of withholding the special treatment 
from them. Their posttest score:? will then constitute the 
best possible estimate of how well the treatment group would 
have done without the treatment. 
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It Is often not ]&ossible to obtain a sample of exactly 
comparable pupils to serve as a comparison group* Under 
appropriate conditions, however, groups which ai'e not 
strictly comparable can be used for estimating non-treat- 
ment performance. Model 3, in fact, divides a class or 
other pre-existing group into treatment and control sub- 
groups at some pretest cutoff score so that all pupils 
above the cutoff go into one group while all pupils below 
it go into the other group. 

The issue of comparison group suitability and the Implica- 
tions of the type of group on the selection of an eval- 
uation design are addressed in subsequent Questions. If 
either type of local comparison group is available, proceed 
to Question 2. 

Where no local comparison group is available, the evalua- 
tion must depend on comparisons between treatment-student 
scores and national norm-group data collected by the pub- 
lishers of standardized tests. This procedure is explained 
in Model 5, page 72. 
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Question 2 



Coument 



Will intact groups or individual pupils be assigned to 
treatment and comparison conditions? 

Groups Skip to Question 5 
Pupils Prone '^d to Question 3 



The most commonly encc mtered type of intact group is a 
classroom, a school, or a grade level within a school. As- 
signment by groups would mean that one third-grade classroom 
was assigned to the treatment condition and another to the 
comparison condition — or that all third graders in one 
school conq)rised the treatment group while all third graders 
in another school constituted the comparison group* Third 
graders from one school who were in the lowest quartile of 
the national distribution in reading could also be considered 
a pre-existing group if they were compared against similar 
children from another school. In all of these cases, the 
condition to which the pupil? were assigned was determined 
entirely by tueir group menibership without regard to any 
characteristics of the individuals. 

On the other hand, if all third graders were listed alpha- 
betically and alternately assigned to treatment and compari- 
son conditions, would say that assignment was by individual 
pupil s Another similar ex&mple would entail the pairing 
of children on the basis of their pretest scores with sub- 
sequent assignment of one member of each pair to the treat- 
ment groi5^> and the other to the comparison group. 

A quite diff-^irent kind of assignment, but one still considered 
assignment by pupils , involves the assignment of pupils who 
score belrw some selected cutoff point on a test to the 
treatment group and those scoring above that point to the 
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comparison group. In this case, some members of an intact 
group were assigned to the treatment condition and others 
to the comparison condition, but it should be clear -that 
assignment to conditions was based on considerations re- 
lating to individual characteristics and not to group member- 
ship* 

Assignment by pupil is generally preferred over assignment* 
by group as this method offers greater control over poten- 
tially biasing factors. The use of pre-existing groups is 
a viable alternative only where the groups are similar in 
all relevant respects to groups which wou].d have resulted 
from assignment by pupil. 



ERLC 



37 

45 



Question 3 



Comment 



Is It possible to assign pu^jlls randomly to treacment and 
comparison groups, or will group membership be determined 
by need? 

Randomly Proceed to Question 4 
By Need Skip to Question 6 



Random assignment Implies that each child In a single **pool" 
or group has an equal chance of being assigned to the compar- 
ison or to the treatment group. One way to accomplish random 
assignment would be to place the names of all the children 
In a hat and then draw them out one at a time assigning 
every other child to the treatment group. There are other 
techniques which are equally suitable, but the decision as 
to whether a child Is assigned to one group or the other 
must be left purely to chance* Group assignment based on 
teacher preferences, children volunteering, or similar 
human actions are not random* To consider them so may be 
seriously misleading (see Hazard 6, p« 19). 

The assumption of random assignment underlies most statis- 
tical tests. A statistically significant t^ or F test means 
simply that the observed difference between groups was larger 
than would normally be expected to result from random assign- 
ment. This, In tur , Implies that If assignment was random , 
the observed difference was probably due to the treatment. 
If assignment was not random, however, a ^'statistically 
significant difference," by Itself, Is generally meaningless. 

Special projects are most often designed to serve particular 
segments of the population (e.g., disadvantaged, gifted, bl-* 
lingual). Under certain circumstances, children In such 
categories can be selected from a heterogeneous group and 



given a special treatment while the remaining children 
serve as a useful and valid comparison group* Questions 
6 and 7 describe the conditions and procedures for imple- 
laenting evaluation models of this type. 
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!• It possible to match pupils on the basis of pretest 

scores before randomly assigning one member of each pair 

to the treatment group tind the other to the comparison group? 

Yes Go to Model 1, page 49 
No Go to Model 2, page 54 



Random assignment usually results In some small differences 
between groups In terms of pretest performance. At least 
some of this difference can be expected to carry over to 
post test performance. For this reason, It is desirable to 
remove these differences, however small, either by pre- 
asslgnment matching (see Model 1, page 49) or by statis- 
tical manipulation after the fact using analysis of co- 
variance (see Model 2, page 54). Pre-asslgnment matching 
Is the preferred technique If feasible and has the addi- 
tional advantage of minimizing computational complexity — a 
significant drawback of covar lance analysis techniques. 

Matching must be accomplished before pupils are assigned 
to groups. The correct procedure is to identify pairs of 
students having equal or essentially equal scores on some 
test known to correlate highly with the post-treatment 
measure. One member of each pair is then assigned to either 
the treatment or the comparison group based on the outcome 
of some random event such as the flip of a coin. The re- 
maining member of each pair is assigned to the other group. 

One of the most common errors in educational evaluation is 
that of matching after assignment. If, for exaiq>le, there 
are two pre-existing groups, it is common to administer the 
treatment to one of them while selecting pupils with matching 
pretest scores from the other to serve as a comparison group. 
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Although coiODion, this procedure Is fundamentally unsound 
and Introduces systematic biases into the data. Unless 
matching can be accomplished prior to assignment, it should 
not be done at all. (See Hazard 8, page 23.) 
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Question 5 



Where a pre-existing comparison group is available, is it 
sufficiently similar to the treatment group so that the 
assignment of pupils to groups can be considered to have 
been "random in effect?" 

Yes Go to Model 2, page 54 
No Skip to Question 7 

Comment 

As discussed in the Comment accompanying Question 3, 
statistical tests of the difference between the means of 
two groups generally rest on the assunqption that group 
membership was determined through random assignment pro- 
cesses. It is possible, of course, for no educationally 
relevant differences to exist between two classrooms of 
third graders in a particular school, or between grade- 
level peers in two schools in a district. Under these cir- 
cumstances, the groups are virtually identical to groups 
which would have resulted from random assignment, and their 
composition may be considered random in effect (Lord, 1967, 
p. 38). 

Where pre-existing, intact groups are used as treatment 
and con^arlson groups, it is not appropriate to assume that 
they are adequately similar. This possibility must be 
investigated empirically, and the onus of proof is on the 
evaluator. Ideally, the process by which students were 
assigned to the two groups should have been effectively 
randcm. At the very least, the two groups must not be 
significantly different in terms of pretest scores. They 
must also be coiiq>arable in terms of socioeconomic status, 
age, sex, and racial composition. School size and setting 
(urban - rural) as well as neighborhood should also be 
comparable. Even when these factors are alike, serious 
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biases are possible • Such biases are introduced when teacher 
or student participation is voluntary or when the designation 
of treatment or comparison group is made by principals or 
teachers. This guidebook discourages any use of local 
comparison groups which are clearly dissimilar to the treat- 
ment group (see Hazard 6), 
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Question 6 



Is assignment to the treatment or comparison group based 
on a cutoff value on some pre-^treatment measure or com- 
bination of measures? 

Yes Go to Model 3, page 59 
No Proceed to Question 7 

Comment 

Where the memberships of the treatment and comparison 
groups are neither random nor random in effect, so called 
"true" experimental designs can no longer be used. Under 
these circumstances "quasi-experimental" evaluation models 
must be employed. 

There are two quasi-experimental evaluation models (the 
Special Regression Models) which can, under certain 
circumstances , provide acceptably conclusive evidence 
regarding treatment impact in situations where pupil 
assignment to treatment and comparison groups is based on 
need rather than randomization. Both of these models, 
however, require the establishment of a cutoff score above 
which ail pupils are assigned to one group and below which 
all pupils are assigned to the other. Numerical ratings 
by teachers, classroom grades, and standardized achieve- 
ment test scores may be used singly or in any desired 
combination, but there must be a single cutoff score. 

Other models exist which do not require assignment to treat- 
ment and control conditions based on a single cutoff score. 
As design requirements of this type are relaxed, however, 
additional assumptions must be made in order to attribute 
the cause of observed between-group differences to treat- 
ment influences, and credibility is thus diminished. These 
models are treated in Question 7. 
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Question 7 



Is there a pre-existing comparison group whose performance 
on the pretest measure is superior to the performance of 
the treatment group? 



Quasi-experimental designs all rest on sets of assumptions 
having varying degrees of plausibility. One such assumption 
which is relevant here and appears ''safe" is that a group 
which is initially superior to another group in cognitive 
development will continue to grow at a rate equal to or 
greater than that of the initially inferior group, other 
things being equal. If, under these circumstances, the 
initially inferior group outperforms the initially superior 
group after participation in a special instructional treat- 
ment, it is probably safe to conclude that the treatment 
was effective. On the other hand, if an initially inferior 
group receives the treatment but fails to surpass the com- 
parison group on the posttest (a typical situation), it is 
difficult to draw conclusions with confidence. Under certain 
conditions regression models not requiring single cutoff 
scores may be applicable (see Model 4). Finally, if the 
treatment was administered to the initially superior group 
and its posttest performance reoained superior to the com- 
parison group, it would be difficult to decide whether the 
superior posttest performance resulted from the treatment 
or simply from the inherent superiority of the treatment 
group. 

If the only available comparison group scores significantly 
lower on the pretest than the treatment group, the infor- 
mation obtainable from it is usually not worth the time and 



Yes Go to Model 4, page 70 
No Go to Model 5, page 72 



Comment 





expense to collect. A norm-referenced evaluation model 
will probably be more useful and will certainly be less 
costly (see Model 5). 
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IV. EVALUATION MODELS 



This section of the guidebook provides descriptive information 
about five evaluation models suitable for use in assessing the cognitive 
benefits resulting from local school projects. They are not necessarily 
the only models suitable for this purpose but they are recommended as the 
most convincing models that can be feasibly implemented given the con- 
straints of operating; school systems. 

The five models are: 

1. Post test Comparison with Matched Groups (p. 49) 

2. Analysis of Covariaace (p. 54) 

3. Special Regression (p. 59) 

4. Generalized Regression (p. 70) 

5. Norm-referenced (p. 72). 

Each of these models is described in terms of general characteristics, 
strengths and weaknesses, and considerations related to its implementation. 
Except where computational procedures are excessively complex and require 
the skills of a sophisticated statistician (the Generalized Regression 
Model), step-by-step procedures are provided for using each of the models. 
References to sources of more detailed information are also included. 

Each of the evaluation models in this section has specific analysis 
requirements. However » several preliminary steps are useful with any 
evaluation model. These preliminary steps are discussed in Chapter VI. 
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Model 1 



Posttest Comparison with Matched Groups 

Sutnmary 

General Characteristics , This model requires that children be paired 
In terms of pretest measures and that one member of each pair be randomly 
assigned to the treatment group and the other to the comparison group. 

Strengths . The matched groups evaluation model provides what Is 
theoretically the most accurate estlms of how the treatment group would 
have done had they not received the special Instructional treatment. This 
high degree of accuracy Is due to the fact that the comparison group Is 
constructed so as to be virtually Identical to the treatment group at pre- 
test time. Thus, If the experiences of the two groups are the same b'^tween 
pre- and posttesi with the single exception of exposure to the treatment, 
the comparison groups should achieve posttest scores which are essentially 
the same as those which would have been achieved by the treatment group had 
Its members not received the treatment. 

Weaknesses . The manner of assigning pupils to treatment and control 
groups employed In this model may produce a greater awareness of group mem- 
bership than other, less obtrusive assignment procedures. Children In the 
comparison group may realize that their group Is not Inherently different 
from the treatment group, yet, for some reason, the other group of children 
Is receiving special attention. This Increased awareness of group member- 
ship may magnify such spurlou« influences as the Hawthorne effect In the 
treatment group or the John Henry effect In the comparison group (see 
Hazard 10). 

Implementation Considerations . This evaluation modal allows a wide 
choice of test Instruments and testing times. However, if, as would be 
recommended, a norm-referenced comparison is also employed, the choice of 
tests and testing rimes becomes more restricted (see the Norm-referenced 
Model, page 72) . 
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According to the Matched Croups Model, children in the treatment 
and comparison conditions are matched on the basis of their pretest scores 
and possibly other educationally relevant variables as well. At posttest 
time^ it is important to use an instrument that measures the same skills 
as the pretest since matching on an irrelevant variable would not reduce 
the experimental error naturally associated with random assignment proc- 
esses. The precision ^jc^ined by matching is, in fact, directly related to 
the correlation between pre- and posttest scores. 

In order to implement the model, it must be possible to (a) pretest 
a group of children large enough in number to form both a treatment and a 
comparison group, (b) pair children on the basis of their pretest scores, 
and (c) randomly assign one member of each pair to the treatment group and 
the other to the comparison group. If eligibility for participation in 
the treatment group is based on some special instructional need and all 
children can served, this procedure is clearly not appropriate, and one 
of the other evaluation models should be implemented. 

Implementation Procedures . 

Step One: Identify a group of potential participants large 
enough in number to form both a treatment and a comparison group. 

Step Two: Administer the pretest to the entire group with an 
instrument known co correlate highly with the measure selected 
for use as the posttest. 

Step Three: Score the pretest. Using their raw scores, identify 
pairs of children with identical or nearly identical test scores. 
Note: Unless pairings are based on identical scores, there is a 
possibility that the mean pretest scores of the treatment and 
comparison groups may differ by amounts large enough to influence 
the evaluation outcome. If such differences are found, covariance 
analysis should be used to adjust for them (see Model 2). 

Step Four: Once children are paired, randomly assign one member 
of each pair to the treatment group and the other to the comparison 



50 

58 



group. Randomization may be done by flipping a coin, using a 
table of random numbers, or any other procedure based on chance 
rather than choice. 

Step Five: Once the groups are formed, it is important to monitor 
their expr^riences over the treatment period. The experiences of 
the tvo groups should be identical with the single exception that 
one group gets the treatment and the other does not. Where this 
is not the case, the differences between groups in posttest per- 
formance may not be the result of the treatment, but rather a result 
of uncontrolled attitudinal or experiential factors (see Hazard 10). 

Step Six: Administer the posttest. If at all possible, the two 
groups should be tested at the same time. Large differences in 
testing times allow potentially relevant experiences to occur for 
one group and not the other. Even small differences such as the 
time of aay, the weather, the emotional climate and other difficult- 
to-assess influences may alter test performances. 

Step Seven: Score the posttest. Raw scores should be converted 
to their standard or scale score equivalents before any con?)utations 
are undertaken. If a test scoring service is used, it should be 
made clear that each raw score should be converted to its standard 
or scale score equivalent. 

Step Eight: Compute the following summary statistics by obtaining 
the indicated formulas from any elementary statistics book. 

(a) The mean and standard deviation of posttest scores for the 
treatment group. 

(b) The mean and standard deviation of posttest scores for the 
comparison group. 

(c) The correlation between groups based on the original pairing 
of children. 

Note that if one member of a pa : is lost, i.e., no posttest score 
is obtained, the other member must be excluded from all of these 
calculations. See Chapter VI for further analysis considerations. 
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Step Nine: Compare the mean posttest scores of the treatment and 
coiiq)arison groups. If the treatment group score is greater than 
the comparison group score, the project may have been effective. 
The statistical significance of the difference should be checked 
using the following t test for paired observation: 



Y - Y 
t c 



N-1 



s^^ + s ^ - 2r^ s^s 
t c tc t c 



/ 



N-1 



where Y = posttest mean standard score of the treatment 
group 

Y = posttest mean standard score of the comparison 
^ group 

s^ = standard deviation of the treatment group 
posttest scores 

s = standard deviation of the comparison group 
posttest scores 

r = correlation between posttest scores of the two 
tc 

groups 

N = number of pairs of children 
Degrees of freedom = N-1 

The one-tailed probability of the computed Jt can be found in the 
tables provided in most standard statistical texts. If it is less 
than or equal to .05 (p $ .0!»), the special project may be said 
to have produced statistically significant achievement gains. 
There is no generally accepted criterion for deciding whether 
the size of the gain is large enough to be considered educationally 
significant. Where standardized tests are used, the standard 
deviation of the national norm group (a) provides a useful reference. 
As a rule of th:mb, the authors suggest that if the observed posttest 
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scores exceed the no-treatment expectation by one-third of a 
standard deviation, the treatment effect be considered educa- 
tionally significant. In other words, if 

\-\>. 0/3 

the gain may be considered educationally significant. 
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Model 2 



Analysis of Covariance 

Sunmary 

General Characteristics > This model Is appropriate to use where 
Individual pupils are randomly assigned to treatment and comparison groups 
or where pre-existing groups which are sufficiently similar (l*e., can be 
considered random samples from a single population) are assigned to treat- 
ment and comparison conditions. Analysis of covariance provides an appro- 
priate statistical adjustnvent to compensate for pretest score differences 
between groups If these differences were due to such chance factors as 
random sampling fluctuations. If pretest differences are real, however — 
that Is 9 the treatment and comparison groups cannot be regarded as random 
samples from a single population*-- covariance analysis systematically 
underad justs for the Initial differences between groups. The underadjust- 
ment produces Inaccurate posttest expectations, thereby preventing meaning- 
ful comparisons between groups. 

Strengths > When between-group differences are evident despite the 
random assignment, analysis of covariance provides the best method of 
adjusting observed posttest scores for random pretest differences* Com- 
paring posttest means that have been adjusted Is always coore precise than 
comparing unadjusted posttest scores. 

Weaknesses . This model assum<»s that treatment and comparison students 
are random samples from a single population so that any difference In pre- 
test performance Is due only to sampling error and random error of measure- 
ment. It will not provide an appropriate adjustment for pretest score 
differences which reflect non-random differences between groups (see 
Hazard 6, page 19). Where analysis of covariance Is employed with data 
from pre-existing, Intact groups, there Is always some danger in presuming 
that the groups are random samples from a single population. 

Implementation Considerations . This evaluation model allows a wide 

5^ 
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choice in the test instruments to be used and in the time of testing. If, 
as would be recommended, a norm-referenced comparison is also made, the 
choices become more restricted (see the Norm-referenced Mcdel, page 72). 

The control group must be very similar to the treatment group. 
True random selection is strongly advised. If the groups were not selected 
randomly, strong evidence is needed to demonstrate that the selection was 
"random in effect" (see Chapter III, Question 5). 

This model involves extensive computations and, unless they can be 
done at little cost or effort on a computer, a decision should be made as 
to whether the analysis is justified. The degree of precision gained by 
employing analysis of covariance depends in part on the correlation between 
the pretest and the posttest. If the correlation is relatively low, the 
adjusted values would not differ very much from the unadjusted values; if 
high, then tb3 posttest means vould be adjusted by a correspondingly high 
proportion of the original pretest difference. Pre- and posttest measures, 
consequently, should be selected to maximize the correlation between them. 
Multiple covariates may be used to achieve this objective. 

Implementation Procedures . 

Step One: Form the treatment and comparison groups. Assignment 
to groups should be based on a random procedure such as drawing 
well-shuffled names from a hat. In some cases, intact classrooms 
may represent a reasoriable approximation to randomly selected 
groups. Groups differing systematically on ethnicity, SES, sex 
or other obvious variables are never satisfactory. Similarly, 
a non-volunteer group can never serve as a comparison group for 
volunteers. 

Step Two: Administer and score the pretest. Testing conditions 
must be exactly alike for the treatment and comparison groups. 
Testing both groups together may be a good idea unless one group 
of students is put at a relative disadvantage, e.g.» by being 
tested in unfamiliar surroundings. 

Step Three: At the end of the project, administer and score the 
posttest. Once again, testing conditions for the two groups should 
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be exactly alike. Raw scores should be converted to their standard 
or scale score equivalents before any computations are undertaken. 
If a test scoring service is used, it should be mad clear that 
each raw score should be converted to its standard or scale score 
equivalent. 

Step Four: If there is no difference between the groups on the 
pretest, analysis of covariance is not needed. In this case, a 
simple t^ test for independent groups is appropriate for testing the 
posttest difference: 



^N/f N - 2 



Y - Y 
t c 




^ ^ / ^H^sJ + N s 2 

t t c c 



+ N - 2 
t c 



/N^ + N 
_t c 

N N 
t c 



where Y = mean standard score of the treatment group 
^ on the posttest 

Y = mean standard score of the comparison group 
c , 

on the posttest 

s = standard deviation of the treatment group 
posttest scores 

s = standard deviation of the comparison group 
^ posttest scores 

= number of treatment group pupils 

N - number of comparison group pupils 
c 

Degrees of freedom « (N^ + N^- 2) 

The one-tailed probability of the computed t^ can be found in the 
tables provided in most standard statistical texts. If it is less 
than or equal to .05 (p <: .05), the project may be said to have 
produced statistically significant achievement gains. 

Step Five: Assuming the groups differed in mean pretest scores. 
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an analysis of co variance is* reconnnended, McNemar (1969, Ch. 18) 
provides a readable explanation of the model. A more complete 
development is available in Winer (1971, Ch. 10), Because of the 
amount of computation involved, the use of a computer is highly 
desirable. Appropriate programs can be provided hy inost computer 
centers. Where the amount of data is small or computer facilities 
are unavailable, the calculations can be done by hand. Instructions 
for carrying out the analysis of covariance and a set of worksheets 
for simplifying the computational work are included in Appendix B. 
These worksheets are referenced directly to the numerical example 
in Winer (1971) and preserve his notation, but are revised for the 
case of two groups (treatment plus comparison). Since the textbook 
examples are for three groups, they are not directly applicable to 
the typical project evaluation. 

Before undertaking a hand-calculated analysis of covariance, it is 

advisable to do a quick check to see whether the effort is justified. 

Analysis of covariance is essentially the same as the above t test, 

but with the posttest difference (Y^ - Y^) adjusted to take into 

account differences between the groups at pretest time (X - X ). 

t c 

If the correlation between pretest and posttest is 1,0, the entire 
difference (X^ - X^) is added to the posttest score of the group 
which was lower at pretest time. This is the maximum possible 
adjustment. Since, in practice, the correlation will be less than 
one. the adjustment will be somewhat smaller. To check whether 
the adjustment is likely to affect conclusions: (a) test the un- 
adjusted posttest difference (Y^ - Y^) using a _t test, and (b) test 
the posttest difference with the maximum adjustment using a _t 
test. If both _t tests are significant, then analysis of covariance 
will also be significant and need not be computed. If both are 
non-significant, analysis of covariance will also be non-significant 
and need not be computed. It is only necessary to carry out the 
analysis of covariance if one _t test is significant and the other 
is not. The two t tests are as follows: 
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(a) No adjustment ; 

Use the formula for t^ given on page 56 exactly as written. 



(b) Maximum adjustment ; 



Use the formula for _t given on page 56 but change the 



numerator to 



where X 



mean standard score of the treatment 
group on the pretest 



X 



c 



mean standard score of the comparison 
group on the pretest. 



0 



Step Six: Instructions for determining the level of statistical 
significance for analysis of covariance are included in Appendix 
B. However, there is no generally accepted criterion for deciding 
whether the size of the gain is large enough to be considered 
edu::ationally significaht. Where standardized tests are used, the 
standard deviation of the national norm group (a) provides a use- 
ful reference. As a rule of thumb, the authors suggest that if 
the observed posttest scores exceed the no-treatment expectation 
by one-third of a standard deviation, the treatment effect be 
considered educationally significant. In other words, if 



Y - Y > a/3 
t c - 



the gain may be considered educationally significant. 
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Model 3 



Special Regression Models 

Summary 

General Characteristics , Two special regression models are con- 
sidered here, the Regression Projection Model (Tallmadge & Horst, 1974) 
and the Regression-discontinuity Model (Campbell & Stanley, 1963). In 
both models, the selection of treatment participants Is determined on the 
basis of performance on the pretest. All pupils In a group are pretested 
and those who score above or below a part:*.cular score are assigned to the 
treatment group while the remaining pupils serve as a comparison group. 

Strengths . Both models make use of an Identifiable and definable 
comparison group. This group offers a sounder basis for establishing no- 
treatment posttest expectations than national norms since the comparability 
of the experiences of the two groups over the pre-to-posttest Interval can 
be empirically verified. The use of a sharp cutoff score In these models 
simplifies the Interpretation of significant results as compared to re- 
gression models which do not require this type of assignment to groups. 

Weaknesses . The Regression Projection Model tests the difference 
between the observed and expected posttest means of the treatment group 
where the "expectation" Is derived from the comparison group regression 
line. The /alldlty of conclusions based on this model rests on the assump- 
tion that the combined-group regression line would be linear over Its entire 
range under no-treatment conditions, an assumption which Is not always 
justified. 

The Regression-discontinuity Model tests the difference between the 
Intercepts of the treatment and comparison groups* regression lines with 
the line representing the pretest cutoff score. In Its simplest form this 
model Involves the same assumption of linear regression as does the Re- 
gression Projection Model, but by using higher-order regression equations, 
(curved regression lines) the problem can be eliminated (Sween, 1971). A 
remaining weakness Is that where treatment Impact Is Inversely proportional 
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to pretest scores (i.e., the lowest scoring students make the biggest gains), 
there may be no difference in regression line intercepts — even where the mean 
gain of the treatment group is highly significant. 

Implementation Considerations . Figure 2 on the following page il- 
lustrates both the Regression Projection and the Regression-discontinuity 
Models. In this idealized conception, the solid-line portion of the ellipse 
to the right of the cutoff score represents the actual distribution (scatter 
plot) of the pre- and posttest scores of the comparison group. It is used 
to estimate what the score distribution for the treatment group would have 
been if there had been no special treatment. This no-treatment expectation 
is illustrated by the broken-line portion of the -.ipse to the left of the 
cutoff score. The actual distribution of the treatment group's scores is 
illustrated by the solid-line portion of the ellipse to the left of the 
cutoff score. This distribution is displaced upward above the no-treatment 
expected scores indicating that the treatment did have the effect of raising 
posttest scores. 

Regression lines are drawn diagonally through the distributions 
shown in Figure 2. As mentioned above, the Regression Projection Model 
involves testing the difference between the observed and the expected mean 
posttest scores while the Regression-discontinuity Model involves testing 
the difference between the intercepts of the regression lints with the cutoff 
score. In the situation shown in Figure 2, regression is linear and the 
amount of treatment impact was independent of the pupils' pretest scorec. 
Under these conditions, the difference between means is identical to the 
difference between intercepts, and the two evaluation models should yield 
identical results. 

Figure 3 depicts a situation in which the treatment had its greatest 
impact on pupils farthest below the cutoff score and a negligible effect on 
pupils right at the pretest cutoff. Under these circumstances, the slope 
of the treatment group regression line is flatter than that of the comparison 
group. There is no difference between the intercepts of the regression lines 
with the cutoff score, but there is a difference between the expected and 
observed mean posttest scores of the treatment group. While this difference 
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vould have been detected by the Regression Projection Model and not by the 
Regression-discontinuity Model, other kinds of treatment isapact variations 
vould be Baore readily detected by the Regression-discontinuity Model, Because 
its assumptions are less subject to question » the latter model may also be 
considered more conclusive, especially in its general form (Sween, 1971). 

It is not possible to provide general decision criteria as to which of 
the two models is more appropriate for a particular situation. Plotting the 
scatter diagram, however, should provide some insight as to what kinds of in- 
fluences are operating and, consequently, which of the two models should be 
used. Knowledge about the treatment may also help. If, for example, a par- 
ticular project provides remedial instruction in proportion to individual stu- 
dents^ needs, it would be more appropriate to expect the kind of impact 
Illustrated in Figure 3 than that shown in Figure 2. In thl3 instance, the 
Regression Projection Model would be the proper choice. 

The utility of both special regression models is proportional to the 
size of the correlation between pre- and post test scores. As the size of the 
correlation Increases, the predicted treatment-group posttest score will de- 
crease producing a corresponding Increase in the difference between predicted 
and observed scores. This relationship is analogous to the covarlance adjust- 
ment discussed on page 55. 

Using test scores as the sole determinant of pupils' needs for special 
instruction is a practice some educators consider unacceptable. This objection 
can be resolved by using a composite measure made up, for exan^le, of a pretest 
score and an Independently made, numerical, teacher rating of need. Of course, 
as with test results, a precise cutoff score must be chosen. 

One additional point is relevant. The entire discussion of the Regres- 
sion Projection Model has assumed that the comparison group regression equation 
would be used to estimate how the treatment group would have performed had they 
not received the treatment. It would appear equally possible to use the treat- 
ment group regression line to estimate how the comparison ^oup would have per- 
formed if they had received the treatment. When the treatment affects the slope 
of the regression line in the manner shown in Figure 3, however, this practice 
vould lead to the erroneous conclusion that the treatment had a negative impact. 
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Implementation Procedures . 

Step One: To implement either special regression moael, adminis- 
ter and score the pretest. The test should be given to ail mem- 
bers of a group from which the treatment pupils are to be drawn 
because of their special needs. The pretest should correlate 
substantially with the posttest measure. 

Step Two: If desired, generate a composite score which incor- 
porates the pretest measure and any other measure such as 
independentJ.y made teacher ratings. 

If a composite score is generated by simply summing the scores 
of the elements comprising it, the elements will automatically 
be weighted in proportion to the standard deviations of their 
distributions. This means that if, for example, we summed 
teacher ratings with a mean of 50 and a standard deviation of 
5 with test scores having a mean of 35 and a standard deviation 
of 10, the composite scores would reflect the test results twice 
as much as they would reflect the teacher ratings (the means of 
the element scores are irrelevant to their weights). If one 
wished to give equal weight to the two elements, it would be 
necessary to equalize their standard deviations prior to summing 
the scores. In the example given, multiplying each of the 
teacher ratings by two would raise the standard deviation from 
5 to 10, thus accomplishing the objective. 

If composite scores are used, it must be remembered that t'ley 
then become the pretest measure. All future calculations involv- 
ing "the pretest measure" must use i.he composite measure—not 
one of its elements. 

Step Three: Establish a single cutoi f score. For a remedial 
project, assign all pupils scoring b low this value to the treat- 
ment group, ^ttrr.atively students scoring above a cutoff score 
might be assigned to a special projecc for the gifted. One con- 
venient way to establish a cutoff score is to determine how many 
pupils can be served by the special project and then to count up or 
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down from the lowest or the highest score until the quota is filled. 
Once the cutoff is established, it must be adhered to strictly. 
There can be no exceptions made in the assignment of each grade 
level if more than one grade level will be involved in the 
special project. 

Step Four: Administer and score the posttest. All available 
pupils in the original group must be posttested even though only 
a relatively small proportion of them may havoi oarticipated in 
the treatment. The subsequent analyses can be performed using 
raw scores although it would be preferable to convert both pre- 
and posttest scores to their standard or scale score equivalents 
if standardized tests are used. 

Step Five: To carry out the computations for either the Regression 
Projection Model or the simplest version of the Regression-dis- 
continuity Model, calculate the iollowing values: 



Tr ea tmen t Compar i son 
Group Group 

Number of pupils 



Mean of (composite) pretest scores 



X 

c 



Standard deviation of (composite) s^ s^ 

pretest scores t c 

Mean of posttest scores 



Standard deviation of posttest s^ 
scores t c 



Correlation of posttest with r^ r^ 

(composite) pretest 



Slope of the regression line for b^ b^ 

predicting Y from X 
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Programs are readily available for all computers and progrananable 
calculators to assist in These calculations* The names or 
descriptions of appropriate programs usually specify that they 
compute Pearson product-moment correlations and, in general, 
all of the above values will be printed out automatically. 

If no computational facilities are available, the calculations 
may be done by hand. Computational formulas and instructions 
may be found in any introductory statistics book. It will sin5>lify 
the task to recall that 




Once the above values have been calculated, the remaining coiq)u- 
tations are relatively simple. 

Step Six A: Regression Projection Model. 

In the Regression Projection Model, the actual mean posttest score 
of the treatment group (Y ) is compared with an estimated no- 
treatment value (Y^) obtained by projecting the conqparison group 
regression line. 

This predicted value is calculated by the following formula: 

^^ " + b (X^ - X ). 
t c c t c 

The amount of the treatment effect ±h the difference between the 
actual and the estimated mean scores, or: 



The statistical significance of this difference icay be tested 
using the following formula (Tallmadge & Horst, 1974): 
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where N - + N 

t c 



P - 

t N 



t 



P 



N 

c " N 



- 2 2 2 

t C 

- 2 2 2 

t c 

b - P,b^ + P b 
t t c c 

Step Si::: Regression-discontinuity Model 

The simplest form o£ the Regression-discontinuity Model consists o£ 
fitting straight regression lines independently to the treatment and 
comparison groups, then testing the difference between the two lines 
at the point where they intersect the pretest cutoff score value. 

let K » the (composite) pretest cutoff score 

■ the Y value of the treatment group 
regression line for a (composite) 
pretest score of K 

Y^ ■ the Y value of the comparison group 
regression line for a (composite) 
pretest score of K 



then 



b^(K - X^) 



Y = Y + b (K - X ) 
c c c c 



Unlike the Regression Projection Model in which a treatment effect was cal- 
culated, there is no special interpretation of the value - unless 
the regression lines have equal slopes. In this case it is a treatment 
effect. However, if this value is significantly greater than zero, it 
is evidence of a real Treatment efftct. The statistical significance of 
the difference may be tested using the following formula (Sween, 1971): 



(Y^ - (N, + \ - 4) 



t c 



+ N 

^ + Z + Z 



N N 
t c 



where 

^t " ^t^Y ^ " ^t^^ 

V = N s^ 2 (1 . r 2) 
c c Y c 
c 

c 

The one-tailed probability of the computed _t value can be found in the 
tables provided in most standard statistical texts. The subscripts for 
^ or are the appropriate numbers to use -^or the "degrees 

of freedom" column in the table. If the one-tailed probability is less than 
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or equal to .05 (p < .05), the special project can be said to have produced 
statistically significant achievement gains. 

There is no generally accepted criterion for deciding whether the 
size of the gain is large encugh to be considered educationally significant. 
Where standardized tests are used, the standard deviation of the national 
norm group (a) provides a useful reference for the Regression Projection 
Model. As a rule of thumb , the authors suggest that if the observed post- 
test scores exceed the no-treatment expectation by one-third of a standard 
deviation, the treatment effect be considered educationally significant. 
In other words, if 

\ - Yj. > a/3 

the gain may be considered educationally significant. 
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Model 4 



General Regression Model 

Summary 

General Characteristics , This model may be thought of as a more 
generalized form of the Analysis of Covarlance Model. Posttest differences 
between any two (or inore) groups can be tested, with adjustments for the 
effects of any number of quantifiable variables such as pretest scores, sex, 
SES, location, etc., and their Interactions. The effects of using curved 
regression lines can also be tested and removed. 

Strengths . The model Itself places no restrictions on the selection 
of students, their relative pretest performance, or on any facet of the 
experimental design. Where other models implicitly assume that posttest 
results are not related to x'ariables other than the pretest, the General 
Regression Model permits systematic tests of this assumption. 

Weaknesses . All forms of the General Regression Model, including 
the special case of analysis of covarlance, test the hypothesis that post- 
test differences are the effect of random fluctuations . Where treatment 
and comparison groups were clearly different to begin with, this is not 
a useful hypothesis to test (Lord, 1967, p. 38). Regression models are 
frequently used to statistically "equate" groups which actually differ 
systematically, but in such cases, regression models systematically under- 
adjust (Campbell & Erlebacher, 1970). It should be noted that this underad- 
justment ic- minor where the correlations between the pretest and the posttest 
are high. Where the treatment group has a lower mean pretest score than the 
comparison group and the correlation is not high, it will overestimate 
expected treatment group results and thereby underestimate the degree of 
project impact. 

Implementation Considerations . While the flexibility of this 
model may permit an adequate evaluation where none of the other models is 
feasible, the complexity of both the multivariate statistical manipula- 
tions rnd the experimental design issues create major obstacles to imple- 
mentation. Only the most sophisticated specialists in these areas 
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should attempt to ple^n and Implement a study of this nature « 

Implementation Procedures * An ad hoc design and detailed procedures 
must be developed '^or each evaluation by a qualified specialist* A com- 
plete, highly technical, mathematical development of the model is avail-* 
able (Horst, 1974). 




71 



73 



Model 5 



Norm-referenced Model 

Summary 

General Characteristic s, Project children are compared to a norm 
group usually comprised of a nationall> representative sample of children 
at the same grade level. The no-treatment expectation is that the project 
pupils will maintain, at posttesting, the same achievement status with 
respect to the norm group as they had at pretesting. If their posttest 
status is higher, the assumption is made that the improvement resulted 
from participation in the special project. 

Strengths . Where no comparison group is available, the norm group 
provides a plausible estimate of no-treatment posttest scores. Even where 
a comparison group is available, unless it comes from the same population 
as the treatment group, the Norm-referenced Model offers a more defensible 
estimate of posttest performance at substantially less cost and effort 
than a comparison- group design. 

Weaknesses . The validity of the model rests on the assumption that 
the achievement status of a particular subgroup remains constant relative 
to the norm group over the pre- to posttest interval if no special treat- 
ment is provided. Empirical support for this assumption is minimal. It 
is conceivable that some subgroups would move up and others move down in 
the normal course of events. When the norm group is like the treatment 
group, the plausibility of the underlying assumption is greatly enhanced; 
thus, for example, norms for gifted children would be best for assessing 
a project serving such pupils. 

Implementation Considerations . This model is widely applicable as 
it does not require a comparison group. The model requires the use of 
standardized tests. The same level of the same test should be used for 
both pre- and posttesting (see Hazard 11). Program participants may not 
be chosen on the basis of their pretest scores (see Hazard 7) . Both pre- 
and posttesting must be accomplished on dates corresponding to the ones on 
which the test publisher collected normative data (see Hazard 3). 
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Implementation Procedures , 

Step One: Select a suitable standardized test which has real 
(not projected) normative data points at dates which are suitable 
for pre- and posttesting. Information about the normative data 
points for some of the most commonly used instruments is presented 
in Appendix A. Siiilar information about other tests can be derived 
from the Technical Manuals provided by the test publishers. 

It can be seen from Appendix A that most tests have only a single 
data point either in fall, winter, or spring. Use of these tests 
requires then, a 12-month pre- to posttest interval. If high 
student turnover is expected, it might be better to choose a test 
for which normative data have been collected in both fall and spring 
even though the choice of tests is then quite limited. 

Step Two: Administer and score the test in exact compliance with 
the procedures specified by the test publisher. Each test score 
should be converted to standard or scale scores. If the tests 
are scored by a scoring service, be sure to specify that each raw 
score should be converted to its standard or scale score equivalent. 

Step Three: Compute the means and standard deviations of the pre- 
and posttest distributions of the standard or scale scores if 
these are not provided by a test scoring service. Also compute 
the correlation between pre- and posttest scores. Computational 
formulas for these ''summary statistics** can be found in any ele- 
mentary statistics book. It is necessary, of course, to do sep- 
arate computations for each grade level participating in the pro- 
ject. 

Step Four: Look up the percentile equivalents of the mean pre- 
test and posttest standard or scale scores in the norm tables 
corresponding to the pre- and posttest administration times. The 
pretest percentile score is used to derive the no-treatment post- 
test expectation. In the absence of a special treatment, it would 



EKLC 



be expected that a group of pupils would maintain its standing 
relative to the norm group. Thus, the expected posttest wean 
score can be found by looking up the standard or scale score 
equivalent of the pretest percentile in the posttest norms table. 
This score constitutes the expected no-treatment posttest mean. 

Step Five: Examine the obtained posttest mean in relation to 
the expected one. If the obtained or observed mean is larger than 
the expected one, there may be some reason to believe that the 
project was effective. The statistical significance of the dif- 
ference should be checked using the following formula: 



Y - Y 




observed mean posttest score 
expected mean posttest score 
pretest standard deviation 
posttest standard deviation 
correlation between pre- and posttest scores 
number of children 
degrees of freedom 

The one-tailed probability of the computed Jt can be found in the 
tables provided in most standard statistical tests. If it is less 
than or equal to .05 (p < .05), the special project may be said 
to have produced statistically significant achievement gains. 

There ia no generally accepted criterion for deciding whether 

the size of the gain is large enough to be considered educ*-tionally 

significant. Since standardized tests are used, the standard 
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deviation of the national norm group (a) provides a useful 
reference. As a rule of thumb s the authors suggest that if 
the observed post test scores exceed the no-treatment expecta- 
tion by one-third of a standard deviation, the treatment 
effect be considered educationally significant. In other 
words » if 

Y - Y > a/3 

the gain may be considered educationally significant. 
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V. GETTING THE DATA (TESTING AND RECORDING) 



Once an evaluation design and an appropriate achievement test are 
chosen, the most crucial step in the evalaar^on process is the collection 
of accurate, complete data. Analysis of the data may be a more technically 
complex step, but, at least, when analysis errors are discovered they can 
usually be corrected. On the other hand, if data are distorted or missing, 
no amount of analysis can adequately correct the problem. If there are 
too many flaws in the raw data, the entire evaluation becomes meaningless* 

There are four steps in obtaining test data, each requiring planning 
and decisions: (a) assembling the students, (b) administering the tests, 
(c) scoring the tests, and (d) recording the ocores. 

Step 1; Assembling Students for Testing 

This step, often passed over lightly, is an important consideration 
for two reasons. First, of course, the time of day and the place where 
students are assembled may affect test scores. The date of tenting may 
also be important (see the Norm-referenced Model, page 72). Second, 
unless the problems are carefully thought out ahead of time, procedures 
used for pretesting students may prove so cumbersome that changes are 
made for the posttest. A change such as testing students In their class- 
rooms rather than in a large assembly hall may or may not make a big 
difference in scores, but it is certai'^^ly unsafe to assume that it will 
not. Hav^^pg taYban^,?^^ half of^na care?a^ll:^ ^elected coSttr^l"^^ group, be- ^ ] 
\ ^l^ise pofiJtV^estinVi^t.to* 'expensive is a%i?o f^VarlylVi^Vsirabl^'* t ciref ul 4 
pfenning ^:):;uid avQid all such problems. ' V 1 ^ i 

^ li\±B /Jiffiqult to 'Generalize about ruies for iv8sembli.<ife students ^ 
bec^ause <Vi th>^^ wia*^\ differences among schools.^ Most i?4?ortantUis to mini- 
mize thei'^-^srt^tion to the stuaents while insu:!ing tha^*,all trcjcHtment and 
comparisoti. ^students can take both pre- and posttests undfer similar 
testing conditions. The major problems in achieving this goal are high 
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absentee rates and distribution of students across a large number of 
schools. Where the evaluation simply Involves testing project students 
in their regular project setting, few problems should be encountered. If, 
on the other hand, control students are Involved, or If students are to 
be tested before the project begins or after It ends, then it Is well 
worth the effort to lay uut In detail the number of different tests or 
test levels to be used, the number of test locations, the time for each 
test, the number of make-up sessions, the number of special test admini- 
strators or supervisors, and so on. Testing often turns out to be a big- 
ger project than anticipated, and, if reduction of effort is necessary, 
it is better to simplify botl. the pretest and posttest proportionally 
rather tlian expending too much effort on the pretest, and then being unable 
to complete the posttest. 

Step 2: Administering the Tests 

It goes without saying that test administration should be orderly, 
and that cheating and other irregularities are not permissible. But orderli- 
ness is not enough. For the purposes of evaluation it is necessary to have 
consistency . There are two kinds of consistency to worry about, depending 
on v/hether a norm-referenced or comparison-group evaluation design is used. 
If a norm-referenced design is used, the critical thing is to be sure 
that the test publisher's procedures are followed exactly. This specifi- 
cally Includes reading instructions, answering questions, doing practice 
problems, and timing} aach oectlon. 

* • WhenliNComparlsc^ group is used, it is still advisable to follow the 
publisher's IrjstrL-ictJon^ to the letter so as to make norm referenced compari- 
4 son possible, i3ut thelm>|st critical thing becomes the similarity between 

\ i •« \ H 

treatment and ciOi^arlsbii group testing situations. The most straightforward 
i wey of insuring! comp^^rable situations is to test both treatment and compari- 
f son students^^as a single gioupl but, usually, in either norm-referenced 
\ or comparison-group designs it will be necessary to test several groups; 

and special steps must be taken to make sure that they are tested under 

similar conditions so that their scores can be compared. 



1. However, bringing ccnparlson group pupils into an unfamiliar project lab 
for testing may put them at a disadvantage. 
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There are basically two ways of making test situations comparable. 
One is to use a few carefully trained administrators to test all the groups 
The other is to carefully train the regular teachers to give the tests 
to their own students. At best, the latter alternative is much less de- 
sirable from a research viewpoint, and some monitoring of the testing pro- 
cedures is advisable. If teachers must be used, it is advisable to 
have them test each others' classes to minimize possible biases. 

Simply telling teachers to look over the test manual is never ad- 
eauate if one is serious about the evaluation. Each test administrator 
should be impressed by the iin)ortance of following procedures exactly > 
and each one should have at least "walked through" the entire process, 
from handing out pencils to collecting the tests, before ever administer- 
ing the test in an evaluation. Where teacher judgments are involved in 
scoring student reponses (as in oral reading tests) , substantially more 
training is required. 

Stop 3; Scoring the Tests 

Scoring of standardized tests is usually separa;:e from test admini- 
stration, so it becomes the third step in the data gathering process. 
Obviously, the most important requirement in scoring is accuracy, but 
there are trade-offs of time and money to consider. The major variables 
are who dees the scoring and what type of answer form to use. Most of 
che major tests can be purchased with machinc-scorable booklets or sep- 
arate answer s|^^3.\ S<>is||^ncri-standardi2ed tests niay be^available qnly ^ 
in hand-scWedMrsionsi M * i ' 

The main vfactor id <i^hoosing^ among answer form^ is the age of the 
students. Separa-^e answer ^|ieets aire usually much eaUer to process,^! 
but young chlldre „ad to ocore loiter on these forms, presumably beci/^ise ^ 
the forms are confusing to them. In general, separate answer shee^ts are 
suitable for above average fourth graders and all older students. iYounfcer 
children should use machiiic-scorable or hand-scored booklets. (Harbours 
Brace Jovanovich, Inc., 1973^.) \ * 

Wlilchever type ol fort\ is used, there i.re three basic ways of having 
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the test scored. Scoring can be done by: (a) local school personnel, 
(b) the publisher of the test, or (c) an independent test scoring company. 
A choice between the test publisher or an independent company will depend 
on a variety of variables specific to the local situation and the test that 
is chosen. Cost, turnaround time, and quality of service may vary, as well 
as the services offered, and some shopping is In order. The major decision, 
however, is whether to have the scoring done by either type of service or 
simply to have the scoring done by available school personnel. Obviously 
there is no general answer that will apply to all situations. The major 
•;dvantages of a good scoring service are the accuracy and the variety of 
analyses provided by computer processing. The major disadvantages are the 
cost, the care necessary in preparing the answer forms, and the turnaround 
time. There are also the possibilities that forms will be lost in shipping, 
or that mishandling or faulty equipment will result in scoring errors. 
There is little recourse when forms are lost, but spot checks on scoring 
accuracy should be made after answer fcims have, been returned. 

"Ballpark" cost figures for machine-scored forms (taken from one 
widely-used publisher's service) range from $.30 to $.70 per pupil depending 
on the type of form and length of the test battery. Hand-scored booklets 
cost three or four times as much to score, although a lower original pur- 
chase price will offset this difference slightly. Clearly, local personnel 
can do the basic scoring at lower cost, but included in this publisher's 
price are a number of features and services that are costly and time con- 
suming when scoring is done by hand. These include; (a) reports with 
Iconvenilnt f(^rmats in' tAL^:lica.<Ce f o5;, each ]^\ou^ class), compl€r*:^.y 

^Vdentififed as^to test,* ditV grfcup etc.; (b^rM £V^|i^es, percentile scopes 
'^\|Iocal Oft national distrioutions) , and stand4rdt*,£icorfes for each studenv on 
e.vch subtest; and (c) mean raw scores for each group. Several other analy- 
ses are available for prices ranging from an additional $.05 to $.12 pe." 
student for each analysis. These include score distributions for each clae3, 
item analyses, and individual student profiles. Additional statistical ^ 
analyses are readily available, or:, for schools with accejs to cheir own 
computer facilities, the scores .ir;^ available from the publicb r on com- 
puter cards or tape. ' ; 
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In short, for very small tryouts with simple analyses it may be 
desirable to do the entire job locally. Unless local computer facilities 
are available, however, more extensive evaluations may w^ll be completed 
more accurately, thoroughly, and economically with the help of a scoring 
service. All the major services have litera. .re and consultants to pro- 
vide details and to assist in planning the scoring and analysis. 

Step 4; Recording the Scores 

Recording the scoras is the final step in the data collection 
process but, to ensurs that the scores will be usable, the details of 
recording should be worked out well before pretest time. Where a com- 
mercial scoring service is used, the school evaluator may have little 
control over the recording process, but if the school elects to do its 
own scoring or wishes to transfer scores from computer printouts to a 
more convenient form, the evaluator must consiaer two important issues: 
the accuracy of the data, and the details of the data recording forms. 

Copying scores accurately onto data forms is not a complicated 
problem for small-scale local studies, but it must not be overlooked. 
Even the most conscientious recorders make errors, and all data forms 
should be carefully proofread, preferably with one person reading aloud 
while a second person checks the scoies. 

The details of the data forms might appear to be of little impor- 
tance, but in many school districts the way in which data have been re- 
corded virtually precludes any reasonable analyses. It is not possible 
to prescribe a standard data format because school requiremf.nts vary so 
widely, but it is possible to state two general principles which must be 
observed. First, all scores must be completely' identified, and second, 
scores must be arranged in a way that facilitates analysis. A sample 
data form illustrating these principals is shown in Figu'te 4. Specific 
issues related to the use of such a form are discussed bHow. 
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Considerations for data recording forios 



1. Most sets of scores require mora than one page. The page number 
identifies each sheet and the "number of pages'* helps make sure 
no pages are missing. 

2. Every sheet of paper should have a name and date to indicate 
who filled in the numbers in case any questions arise in the 
future. 

3. The group for which data are recorded should be clearly identi- 
fied at the to£ of the page to simplify the retrieval of that 
group's data from a large data base. 

4. The page should be arranged so that it can be photocopied with- 
out the students' names. This permits wide use of the data for 
research purposes without compromising student privacy. 

5. It simplifies analysis greatly to hu^e only one test (pre and 
post) recorded on each sheet, provided the rules for listing 
students ^see poirts 6-11 below) are followed. The complete 
name of the pret^.st and posttest (taken exactl y from the test 
booklets and including public .tion date) must be listed. This 
point is widely neglected. 

6# Identifying students and organizing their nasiPS efficiently are 
the most difficult problems in recording student data. Where 
evaluations are only for one year and are based on fall and 
spring testing, the problems can be solved with a little effort 
and care. But where students must be followed over several 
f ' > * * years^ the\re is no sim^ile solution sincfe stSi^nts come and go | 

* iron: flrojefcts, and groups are reorganized evely year. The sim- 

J , plest rule lis to make sure that the postltest fjcores are all 

I i entered on The same sheet of paper as the corresponding pretest, 

* scores. This at least eliminates the problem of the evaluator A 
trying to find each student name on two lists. 

7. A second rule;^ f or listing student nanes is to establish a 
\ standard orde:^ng of ^.he names, and stick to it for the life 
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of the evaluation and for all tests that are used. If a student 
moves or fails to take some of the tests, then the appropriate 
entries are blank, of course, but he should not be eliminated 
from the list. If new students enter the program, their names 
should be added to the end of the lists for all tests, even 
those for which no data will be entered. In addition to the 
obvious reduction in confnsion. there are some practical advan- 
tages to this procedure. For example, a master form can be 
prepared with only the students' names and identification 
nimbers filled in, and the forms can simply be duplicated when 
new tests are given. It also makes comparisons or correlations 
between any two sets of scores relatively easy because any two 
forms can be laid side by side and the corresponding nam^s will 
line up correctly. If there Is a compelling reason to change 
the order of student names in the middle of a ;^roject, then 
either all forms should be changed, or a double set of forms 
(old and new order) should be maintained. 

8. A rule should be established for recording names. "Caldwell, 
D.E." should never become "Danny Caldwell" on a second list. 
The simplest procedure is to allow plenty of space and to spell 
out first names and middle initials (e.g., Caldwell, Daniel £«)• 

9. Each student should have an ID number that completely identifies 
him. The example in Figure 4 uses a one digit experimental con- 
dition number, a two-digit group or class identification, a one- 
digit sex code, and fa fiVo-digit studjent number. In some evalua- 

l^tioas, other lAl<5s^\|nc*!iuding letters) can be useU, but care£|al \ ^ % 
I consideration o^ tl^tsit-uation is necesi^ary in order to permit 7 

'ill • v> 

any desired grouping Is injply by ID number;^ 

\\ ' 

10. A page should have Si\\\e reasonable numberv of entries, probably 

\\ \ c 
20 or 25. For some iVr^^xplicable reason, r;^imbers like 27 and 33 

are popular, and often \the number of entries varies from page to 

page. Unnecessary complications like this help to make the 

statistician's life mis^erable. 
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Test dates are critical, especially in norm-referenced evalua- 
tions. If all students listed on a page have their pretests in 
one day and all are later post tested in a single day, then the 
test date column is not really necessary. However, this is 
usually impossible to predict at the time the form is made up, 
so the columns should be there in order to permit identifica*- 
tiou of make-up tests und late entries into the program. 

Pre- and posttest scores should, in general, be in adjacent 
columns, rather than pairing each pretest raw score with its 
standard score, percentile score, etc., followed by each post- 
test score and its transformations. This greatly simplifies 
the mechanics of analysis; comparisons are nearly always made 
between pre- and posttest scores of the same type. 
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VI. ANALYZING THE DATA AND REPORTING THE RESULTS 



Analysis 

Basic decisions relating to data analysis should be a part of the 
original evaluation ^tanning. The major decision is the selection of a 
suitable evaluation model, treated in Chapters III and IV. A second con- 
sideration which should be settled at the same time is the division of 
the students into analysis subgroups. Because of the advantages of having 
large numbers of students in ;^n analysis, there is some temptation to 
analyze all available treatment students as a single group., and, where 
comparison students are used, to combine all of them into a second group. 
This practice is not justified when distinct subgroups of students are 
represented. In particular, it is almost never advisable to combine data 
from (a) different treatment conditions, (b) different grade levels, or 
(c) different tests • In most education projects it is more meaningful to 
analyze each subgroup separately, draw separate conclusions for each sub- 
group, and then summarize the results of these individual analyses. Unless 
adequate thought is given to the analysis subgroups in the Initial planning 
stages, the subgroups may be too small or too heterogeneous to permit any 
convincing conclusions. 

When the analysis subgroups are determined and the data are in hand^ 
the analysis can proceed. The essential steps for implementing each of 
the five evaluation models are treated in Chapter IV, but the following 
preliminary analysis and screening procedures should substantially facili- 
tate incerpretation of the findings. 

A. For students with both pre- and post test scores: 

(1) Plot the distribution of the pretest raw scores, and 
compute the mean and standard deviation. 

(2) Plot the distribution of the posttes'^ raw scores, and 
compute the mean and standard deviation. 
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(3) Plot the joint pretest-posttest distribution, and compute 
the product-rooment correlation. 

B. For students with pretest scores only: 

Plot the distribution of the pretest raw scores, and 
compute the mean and standard deviation. 

C. For students with posttest scores only: 

These scores are usually not interpretabie by themselves, 
but may be saved for student files or used as baseline 
data for following-year evaluations. 

In general, the size of any achievement gains will be apparent from 
the above analyses. The differences in mean scores which are tested stefis- 
tically in the various models can be inspected giaphically by comparing the 
appropriate distributions. However, an equally important use of the plotted 
distributions is to permit inspection of the data for irregularities which 
may influence the interpretation of results. It is not possible to list all 
the kinds of irregularities that might be encountered, but the following 
occur frequently and are important: 

Floor or ceiling effects : Pretest and posttest distributions should be 
inspected to t*ee whether they are bunched near the top or the bottom of the 
score range. The top of the score range is simply the highest possible raw 
score. The bottom of the scora range may b^: zero, but for multiple choice 
tests it is usually taken to be the score th.it would be expected if students 
were simply guessing. For example, in a typical four-choice test students 
could be expected to get about one fourth of the items correct by guessing. 
The impacts of floor and ceiling effects are discussed in Hazard 4, page 15. 

Large changes in standard deviations from pretest to posttest : A large 
increase in standard deviatjon TP\y simply indicate that there were problems 
with posttesting but it may also indicate that the project is spreading the 
students out by nelping the initially better students relatively more than 
the others. A decrease may indicate that initially low scoring stjdnnts are 
helped relatively more. Either effect would be an important finding and 
should be described in any evaluation report on the project. 
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Low correlations between pre- and post test scores or irregular 
joint distributions ; These symptoms can be the result of a variety of 
problems but, typically, they Indicate that the tests are not measuring 
the attribute of Interest with sufficient reliability. If the skill is 
not measured reliably then, clearly « Improvements will not be adequately 
measured, and positive project results may be obscured. With standardized 
tests, correlations of .80 to .90 are possible. As correlations drop^ 
results become correspondingly more ambiguous. 

Differences between pretested students who took the posttest and 
those who didn^t ; If students who have only pretest scores appear to be 
much different on the pretest from those who took both pre- and posttests, 
some investigation is required. There are many possible explanations. 
The better students may graduate, or poorer students may drop out, or both. 
Such findings are themsjelves important, and may also be relevant to the 
Interpretation of posttest distributions. If the better students are 
missing from the posttest dietribution, the mean score will be depressed. 
If nhe poorer students are missing, the mean score will be spuriously in- 
flated. 

Once the data have been carefully examined^ the statistical tests 
of the appropriate model may be applied. In most cases the results will 
have been clear from inspection of the distributions, and a test of signifi- 
cance will serve mainly as a concise, easily reported confirmation that 
differences were or were not likely to be due to chance factors. It must 
be remembered that statistical significaace depends, in practice, on the 
number of students in the distrihat%p;mJi. ^ Even ^trivial differences in mean 
scores become statistically significant \wl:^n hundreds ^of scudesips are in- ! 
volved. Conversely, almost any educationa-^-ly important effect ^vrill prove 
to be statistically significant, even with^as few as 25 oi: 30 students. 

Th> question of how big a gain must be before it is considered edu- | 
cationally important is, of course, a judgmental question rather than a 
jtatiotlcal one. The evaluator or the project director may well be called 
upon to offer an opinion on this issue, and while ro specific guidelines 
could cover the variety of settings and situationp foi educational projects. 




the above comments suggest three issues that must be clearly separated in 
drawing conclusions about the educational importance of project effects. 
One issue is the size of the project effects. A second is the cost 
associated with implementing the project, and the third is the conclusive- 
ness of the evaluation results. 

The importance of a given project effect usually depends on the cost 
of the project and the available alternatives. That is, a project that 
costs very little in money or effort may be very worthwhile even if its 
effects are rather small, provided there are no obvious, superior alterna- 
tives. Any large effect is obviously important in principle, but in prac- 
tice it may be very costly, and cheaper alternative projects may have 
comparable effects. While neat cost-effectiveness conclusions are still 
beyond the state-of-the-art in educational evaluation, decisions should be 
based on the best information that can be provided. 

In addition to the size of the project effect, the conclusiveness 
of the evaluation should also be discussed. The total evaluation should 
be weighed in terms of all of the issues discussed in this guidebook, and 
factors that appear to affect the -results should be noted. The hazards 
discussed in Chapter II, the model weaknesses from Chapters III and IV, 
and the data collection issues of Chapter V must all be considered. 
Further, it is the position of the authors that conclusive generalizations 
about a project are possible only after amassing consistent evidence from 
a variety of evaluations over a period of time. No single tryout can pro- 
vide a sound basis for generalizations, no matter how carefully it is con- 
ducted. 

The evaluation of projects within an educational system is an 
extremely difficult task, and decision makers need to become aware of 
the practical limitations of the process. Often, complications beyond the 
control of the evaluator preclude any definitive conclusions about project 
if f ectiveness, and it is the responsibility of the evaluator to reflect 
this situation accurately in the evaluation report. 

Finally, it should be noted that all of the models in this guidebook 
are directed at the queution of how much better students did in the project 
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than they would have done without It. Decision makers may, however, be 
Interested In some other criterion, such as bringing the mean scores of 
treatment students up to the national norm. This particular criterion is 
widely encountered, and although It may represent a aeanlngful goal, a 
word of caution is In order. Althou^ i every evaluator will recognize that 
exactly half of all students will always be below the national average, It 
is never safe to assume that the decision maker understands this statisti*- 
cal truism. A brief discussion of the issue, including the reasonableness 
of the goal for the particular treatment students in the project, should 
always accompany any reference to such a criterion. 

Reporting 

The evaluation report is the final link in *-he evaluation process. 
Unless the results are adequately presented, the entire evaluation is of 
little use to anyone. A variety of people will be interested in the results 
and, ideally, a separate report should be prepared for each type of audience. 
In practice, however, only one report will be written, and it should cover 
the requirements of a wide range of readers. The recommendations below 
assume at least two basic audiences: (a) the local school board and ad- 
ministrators, and (b) educators, government officers, and school personnel 
outside of the local district. The first group will Include non-specialists 
primarily interested in an easily understood description of the project 
results. The second group will Include skeptical evaluation specialists 
who must be convinced that the findings are valid. To meet the needs of 
the first group, a clear summary of the project and the results should be 
provided. This sunmary should not be more than two or three pages long 
and should be Included at the front of the report. The body of the report 
phould be concise, but complete, in order to meet the needs of the critical 
evaluation specialist. It should cover the issues of objectives, costs, 
and affective changes as well as achievement gains. Report organization 
and appropriate topics other than achievement gains are discussed in detail 
in Hawkrldge, Campeau, & Trlckett (1970). Examples of appropriate section 
headings and formats can be found in any educational research journal. 

In presenting achievement gains, a convincing report must explain 

ERIC 



exactly what was done In the evaluation, provide statistics summarizing 
the results, and justify the conclusions of the ^valuators. In preparing 
the description of what was done, it should be kept in mind that the 
critical reader will be concerned about all of the hazards in Chapter II 
of this guidebook and is likely to analyze the evaluation report system- 
atically for possible weaknesses (as in Tallmadge & Horst, 1974). Where 
information is missing, he will probably assume the worst. Ideally, all of 
the questions raised in Chapter II and in Talln<iadge and Horst (1974), as 
well as those in Chapter IV specific to a particular model, should be 
anticipated and discussed. 

At a minimum, the report should include a brief description and 
justification of the model used, a summary of the data, and the results 
of significance tests. A wide-spread error is the omission of summary 
statistics that are required if the results are to be meaningful. In 
particular, evaluation reports often present only mean scores as evidence 
of effectiveness. While means alone may be sufficient in a report summary, 
every mean score reported in the body of a report should be accompanied 
by the number of students represented (N), and the standard deviation of 
the distribution(s). In addition, it must always be clear whether or not 
any two means represent exactly the same group of students. Claims of sta- 
tistical significance should clearly elaborate (or reference) the exact 
test used, as well as the numerical results of the test. Discussions of 
educational importance should clearly indicate the local standards against 
which the project is compared. The local setting also bears on the extent 
to which the project might be replicable in other school districts, and 
should be spelled out as clearly as possible. 

The evaluator's final decision concerns the saving of information 
from the evaluation. The published report will provide summarized results, 
but many of the analyses and statistics recommended in this chapter will 
not be included. It is not customary, for example, to include graphs of 
score distributions in a report unless they illustrate some special point. 
Most evaluators will, however, want to keep these graphs plus all calculated 
atatistics on file for future reference. Whether the raw data recording 
sheets are saved or not depends on local policy and on the possible use of 
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the data In evaluations during subsequent years. Providing the preliminary 
analyses of this chapter and the specific analyses of Chapter IV have been 
carefully completed and documented. It Is unlikely that the raw data will 
be needed for future reanalyses. 
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!• California Achievement Test (1970 Edition) 

A. Levels/Grades/Forms 

Level 1 / Grades 1.5-2 / Fora A 
Level 2 / Grades 2-4 / Form A 
Level 3 / Grades 4-6 / Form A 
Level 4 / Grades 6-9 / Form A 
Level 5 / Grades 9-12 / Form A 

B. Normative Data Point 

February-March (All norms are projections and should not 
be used In fall-to-spring norm-referenced evaluations.) 

C. Types of Scores 

Raw Scores (appropriate for use with Anchor Test Study 

Equivalency Tables) 
Grade-equivalent Scores 

Achievement Development Scale Scores (Expanded standard scores 

which should be used for all statistical coiiQ)utatlons not 

Involving Anchor Test Study conversions.) 
Percentiles and Stanlnes (All percentiles and stanlnes 

are projections and should not be used In norm-referenced 

fall-to-spring evaluations.) 

D. Comments 

The reading scales of Levels 3 (Grades 4 and 5) and 4 
(Grade 6) were Included In the Anchor Test Study. The 
CAT may thus be used for norm-referenced evaluations 
under the following conditions: 

1* pretest and posttest In mld-Aprll (12-month 
Interval) using CAT spring aorms ; 

# 
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2. pretest and post test In mld-Aprll (i2-!nonth In- 
terval) using Anchor Test Study Individual Score 
Norms* for reading only, and gradee 4, 5, and 6 
only; 

3. pretest In mid-October, posttest In mld-Aprll 
using Anchor Test Study Equivalency Tables* and 
Metropolitan Achievement Test norms for reading 
only, and grades 4, 5, and 6 only. 



* The following procedure is recommended for use with Anchor Test Study 
data. First, convert each pupil •s CAT raw score to the equivalent MAT 
raw score. Second, convert each MAT raw score to its corresponding stand- 
ard score. Third, calculate all statistics using MAT standard scores. 
Then, if Anchor Test Study norms are to be used, convert the mean MAT 
standard score to its MAT raw score equivalent. The corresponding per- 
centile can then be read from the Individual Score Norms Tables (not 
the School Means Norma Tables). If the MAT norms are to be U8ed,npercentlle 
equivalents are provided corresponding to mean standard scores. 
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C ooperative Primary Tests (l 965 R dltjnn) 

A. Levels/Grades/Forms 

12 / Grades 1.5-2.0 / Forms A u B 
23 / Grades 2.0-3.9 / Forms A & B 

B. Normative Data Points 

Late October-early November and late April-early May 

C. Types of Scores 

Raw Scores 

Scale Scores (Expanded standard scores which should be used 

for all statistical computations.) 
Percentiles 

D. Comments 

This test has appropriate norms for a fall pretest-spring 
posttest norm-referenced evaluation. It was not included 
in the Anchor Test Study because it does not cover grades 
4, 5, and 6. 
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3, Coaprehensiys Tests of Basic Skills (1968 Editio n) 



Level s /Grades / Forms 

Level 1 / Grades 

Level 2 / Grades 

Level 3 / Grades 

Level 4 / Grades 



2.5-4 / Forms Q & R 

4-6 / Forms Q & R 

6-8 / Forms Q & R 

8-10 / Forms Q & R 



B. Normative Data Point 



Last week of February-first week of March (All norms 
are projections and should not be used in fall-to- 
dpring norm-referenced evaluations.) 



C. Types of Scores 

Raw Scores (appropriate for use with Anchor Test Study 

Equivalency Tables) 
Grade-equivalent Scores 

Expanded Standard Scores (should be used for all statis- 
tical computations not involving Anchor Test Study con- 
versions) 

Percentiles and Stanines (All percentiles and stanines 
are projections and should not be used in fall-to-spring 
norm-referenced evaluations • ) 



D. Comments 

The reading scales of Level 2, Form Q (Grades 4 and 5) 
and Level 3, Form Q (Grade 6) were included in the Anchor 
Test Study. The CTBS may thus be used for norm-relarenced 
evaluations under the following conditions: 

1. pretest and posttest in mid-April (12 -month interval) 
using CTBS spring norms; 
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2* pretest and post test in lald-Aprll (12-month 
interval) using Anchor Test Study Individual 
Score Norms* for reading only, and grades 4, 
5, and 6 only; 

3. pretest in mid-October, ponttest in mid-April 
using Anchor Test Study Equivalency Tables* and 
Metropolitan Achievement Test norms for reading 
only, and grades 4, 5, and 6 only. 



* Procedures recommended for using Anchor Test Study Equivalency Tables 
and norms with the California Achievement Test are presented in the foot- 
note on page 94* The same procedures should be used with Form Q of the 
CTBS. If Form R of the CTBS is used, each raw score must be converted to 
its Form Q equivalent (using conversion tables provided by the publisher) 
before the Anchor Test Study tables are used. 



98 



ERLC 



105 



4. Gates-MacGlnltle Reading Tests (1964 Edition) 



A. Levels/Grades/Forms 

Primary A / 1.5-2.0 / 1, IM, 2, 2M 
PrJjBary B / 2.0-3.0 / 1, IM, 2, 2M 
Primary C / 3.0-4.P / 1, IM, 2, 2M 
Primary CS/ 2.5-4.0 / 1, IM, 2, 2M, 3, 3M 
Survey D / 4.0-7.0 / IM, 2M, 3M 
Survey E / 7.0-20.0/ IM, 2M, 3K 

B. Normative Data Points 

October and April except January for firat grada* (Feb- 
ruary and May nom^ are projections. Because o£ the 
proximity of the Jiay norms to the April data point, the 
May norms are probably adequate for use with norm-refer- 
enced conqparisons . The February norm, however, cannot 
be recommended for use with such comparisons.) 

C. Types of Scores 

Raw Scores (appropriate for use with Anchor Test Study 

Equivalency Tables) 
Grade Scores 

Standard Scores (should be used for all statistical compu- 
tations not Involving Anchor Test Study conversions) 

D. Conpients 

The standard scores provided for the Gaf'.^-MacGlnitle are 
not expanded standard scores. It is thu6 not possible to 
relate scores from one level of the test to noma for 
another level, so using test levels with appropriate 
norms may produce ceiling or floor effects when disad- 
vantaged or gifted students are tested. (See Hazard 4, p. 15.) 
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Survey D, Form IM was Included in the Anchor Test Study. 
The Gates-MacGlnltle may thus be used for norm-referenced 
evaluations under the following conditions: 

1. pretest In mid-October, post test In mid-May using 
Gates-MacGlnltle norms (but with the possibility 
that celling and floor effects may be encountered); 

2. pretest and posttest In mld-Aprll (12-month Interval) 
using Anchor Test Study Individual Score Norms* In 
grades 4, 5, and 6 only; 

3. pretest In mid-October and posttest In mid-April 
using Anchor Test Study Equivalency Tables* and 
Metropolitan Achievement Test norms in grades 4, 
3, and 6 only. 



* Procedures recommended for using Anchor Tesc Study Equivalency Tables 
and norms with the California Achievement Test are presented in the foot- 
note on page 94. The same procedures should be used with Form IM of the 
Gates-MacGlnitie. The implication of using other forms is not clear as 
score equivalency tables are not provided by the publishers, despite the 
probable existence of between-form differences* (The test publishers ap- 
parently presume that the differences are so small as to be negligible.) 
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lova Test of Basic Skills (1971 Edition) 



A. Le ve 1 s / Gr ad e s / Fo r ms 

Level 7 / 1.7-2.5 / Forms 5 & 6 

Level 8 / 2.6-3.5 / Forme 5 & 6 

Level 9 / 3.0-3.9 / Forms 5 & 6 

Level 10 / 4.0-4.9 / Forms 5 & 6 

Level 11 / 5.0-5.9 / Forms 5 & 6 

Level 12 / 6.0-6.9 / Forms 5 & 6 

Level 13 / 7.0-7.9 / Forms 5 & 6 

Level 14 / 8.0-8.9 / Forms 5 & 6 

Normative Data Point 

Last half of October, first half of November (Mid-year and 
spring norms are projections and should not be used for 
norm-referenced evaluations . ) 

C. Types of Scores 

Raw Scores (appropriate for use with Anchor Test Study 

Equivalency Tables.) 
Grade-equivalent Scores 
Age-equivalent Scores 

Standard Scores (Expanded standard scores which s»iould be 

used for all statistical computations not involving Anchor 

Test Study conversions.) 
Percentiles and Stanines (Mid-year and spring scores are 

projections and should not be used for norm-referenced 

evaluations. ) 

D. Comments 

The reading scales of Levels 10 (Grade 4), 11 (Grade 5), 
and 12 (Grade 6), Form 5, were included in the Anchor Test 
Study. The ITBS may thus be used for norm-referenced evalu- 
ation under **he following conditions; 



1* pretest and post test In late October-early November 
C12-]iK)nth Interval) using ITBS norms; 

2* pretest and posttest In mld^Aprll 0-2*-month Interval) 
using Anchor Test Study Individual Score Norms* for 
reading only, and grades 4, 5, and 6 only; 

3v pretest In mid-October and posttest in mld-Aprll 
using Anchor Test Study Ei * Tables* and 

Metropolitan Achievement norms for reading 

only, and grades 4, 5, and 6 only* 



* Procedures recommended for using Anchor Test Study Equivalency Tables 
and norms with the California Achievement Test are presented on page 94 « 
The same procedures should be used with Form 5 of the ITBS. The implica*' 
tions of using other forms is not clear as score equivalency tables are 
not provided, despite the fact that some betveen-form differences are 
present. (The test publishers apparently presume that the differences are 
so small as to be negligible.) 
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6. Metropolitan Achievement Tests (1970 Edition) 



A« Levels /Grades /Forms 



Primary 1 / 1.5-2.4 / F, 


G, 


H 


Primary 2 / 2.5-3.4 / F, 


G, 


U 


Elementary / 3.5-4.9 / F, 


G, 


H 


Intermediate / 5.0-6.9 / F, 


G, 


H 


Advanced / 7.0-9.5 / F, 


G, 


H 



B. Normative Data Points 

Mid-October and mid -Apr 11 

C. Types of Scores 

Raw Scores 

Grade-equivalent Scores 

Standard Scores (Expanded standard scores which should be 

used £or all statistical computations «) 
Percentiles and Stanlnes 

D« Comments 

The reading scales o£ Form F o£ the Elementary (Grade 4) 
and Intermediate (Grades 5 and 6) Levels were Included 
in the Anchor Test Study. The MAT may thus be used £or 
norm-re£erenced evaluation under the following conditions: 

1. test in mid-October and/or mid-April (fall-to*, pring 
or 12-month interval) using MAT norms; 

2. pretest and posttest in mid-April (12-month Interval) 
using Anchor Test Study Individual Score Norms* for 
reading only, and grades 4, 5, and 6 only« 



* If Anchor Test Study iiorms are to be used, convert the mean MAT standard 
score to its raw score equivalent. The corresponding percentile can then be 
read from the Individual Score Norms Table ( not the School Means Norms Tables). 
If the MAT norms are ♦'o be used, percentile equivalents are provided corre- 
sponding to mean standard scores. 
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7. Sequential Tests of Educational Progress II (1969 Edition ) 



A. Levels / Grades / Forms 

4 / 4-6 / A, B 
3 / 7-9 / A, B 
2 / 10-12 / A, B 

B. Normative Data Point 

Last week in April, first three weeks in May (Fall norms 
are identical to the spring norms for the previous grade • 
As such, they should not be used in norm-referenced eval- 
uations O 

C. Types of Scores 

Raw Scores (appropriate for use with Anchor Test Study 

Equivaisrrcy Tables) 
Converted Scores (Expanded standard scores which should be 
Ubed for all statistical computations not involving Anchor 
Test Study conversions.) 
Percentiles and Stanines (Fall scores are projections and 
should not be used in norm-referenced evaluations,) 

Comments 

The reading scales of Level 4, Form A, were included in the 
Anchor Test Study. STEP II may thus be used for norm- 
referenced evaluations under the following conditions: 

1. pretest and posttest in early May (12-month 
interval) using STEP II norms; 

2. pretest and posttest in mid-April (12-month 
interval) using Anchor Test Study Individual 
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Score Norms* for reading only, and grades 4, 
5, and 6 onlyj 

3. pretest in mid-October, posttest in mid-April 
using Anchor Test Study Equivalency Tables* and 
Metropolitan Achievement Test norms for reading 
only, and grades 4, 5, and 6 only. 



* Procedures recommended for using Anchor Test Study Equivalency Tables 
and norms with the California Achievement Test are presented in the foot- 
note on page 94. The same procedures should be used with Form A of STEP II. 
If Form B is used, each raw score must be converted t- its Form A equivalent 
(using conversion tables provided by the publisher) before the Anchor Test 
Study Tables are usad. 
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8. SRA Achievement Series (1971 Edition) 



Leye 1 8 / Gr ad e s /Fo r ms 



Primary I / 1.0-5.5 / E, F 
Primary II / 1.0-5.9 / E, F 
Blue / 3.5-8.5 / E, F 

Green / 4.5-9.9 / E, F 

Red / t. 5-10.5 / E, F 



B. 



C. 



Normative Data Poin^ 

Mid-April (Beginning- and middle-of-year norms are projec- 
tions and should not be used in norm-referenced evaluations.) 

Types of Scores 

Raw Scores (appropriate for use with Anchor Test Study 

Equivalency Tables) 
Grade-equivalent Scores 
Growth Score Values 

Percentiles and Stanines (Beginning- and middle-of-year 
scores are projections and should not be used in norm- 
referenced evaluations.) 

Comments 

Form E of the Blue level (Grades 4 and 5) and Fo.rm E of the 
Green level (Grade 6) were included in the A:;chor Test 
Study. The SRA Achievement Tests may thus be used for norm- 
referenced evaluations under the following conditions; 



1. 



pretest and posttest in niid-April (12-month 
interval) using SRA Achievement norms; 



2. 



pretest and posttest in mid-April (12-month 
interval) using Anchor Test Study Individual 
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Score Nonas* tor reading only> grades 4^ 5^ 
and 6 only; 

3. pretest In Bild«*October and posttest In mld«- 

April using Anchor Test Study Equivalency Tables* 
and Metropolitan Achievement Test norms for 
reading only, and grades 4, 5, and 6 only* 



* Procedures recommended i^or using Anchor Test Study Equivalency Tables 
and norms with the California Achievement Test are presented in the foot- 
note on page 94. The same procedures should be used with Form E of the 
SRA Achievement Tests. The implication of using Form F is not clear as 
Bcore eqrivalency tables are not provided by the publishers, despite the 
probable existence of betveen-form differences. (The teat publishers 
apparently presume that the differences are so small as to be negligible.) 
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9. Stanford Achievement Tests (1973 Edition) 
A • Level s / Grades / Forms 



Primary I 


/ 


1.5-2.8 


/ 


A, 


B, 


C 


Primary II 


/ 


2.5-3.8 


/ 


A, 


B, 


c 


Primary III 


/ 


3.8-4.8 


/ 


A, 


B, 


c 


Intermediate 


I / 


4.8-5.8 


/ 


A, 


B, 


c 


Intermediate 


II / 


5.8-7.8 


/ 


A, 


B, 


c 


Advanced 


/ 


7.1-9.8 


/ 


A, 


B, 


c 



Normative Data Points 

October, February, and May (Most of the SAT percentile and 
stanlne norms tables are closely tied to empirical data* 
The following, however, are projections and should not 
be used for norm^-referenced evaluations: Primary II, 
grade 3.5; Primary III, grades 3.5 and 4.5; Intermediate 
I, grades 4.5 and 5.5; Intermediate II, grades 6«5 and 
7.5; Advanced, grades 7«5, 8.5, and 9«5«) 

C. Types of Scores 
Raw Scores 

Grade-equivalent Sc:>res 

Scaled Scores (Expanded standard scores which should be used 

for all statistical computations.) 
Percentiles and Stanlnes (Percentiles and stanlnes obtained 

from the projected norms tables listed above should not 

be used for norm-referenced evaluations*) 

D« Comments 

An earlier edition of the Stanford Achievement Tests (1964) 
was included in the Anchor Test Study. The nev; edition, 
however, has many advantages over the old and should be 
preferred — despite the fact that it cannot be used in con- 
junction with the Anchor Test Study Equivalency Tables* 
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APPENDIX B 



Analysis of Covarlance Worksheets 

Analysis of covarlance Is both theoretically and computationally 
conplex. An evaluator undejitaklng this analysis should have access to 
a good reference book describing the approach In detail. Tatsuoka (1971, 
Ch. 3) and McNemar (1969, Ch. 18) provide readable explanations of the 
model. A more complete development Is available In Winer (1971, Ch. 10). 
Because of the amount of coii?)utatlon Involved, the use of a computer 
Is highly desirable. Appropriate programs can be provided by most com- 
puter centers. 

Where the amount of data Is small or computer facilities are un- 
available, the calculations can be done '^y hand. This appendix provides 
a set of worksheets for simplifying the coii5)utatlonal work. The work- 
sheets are referenced directly to the numerical example In Winer (1971, 
p. 775) and preserve his notation, but are revised for the case of two 
groups (treatment plus comparison) . Since the textbook example Is for 
three groups. It Is not directly applicable to the typical project eval- 
uation. 

Four worksheets are provided: 

Worksheet One Is used to record Intermediate results that are used for 
the remaining calculations. All of the terms in columns one and two 
will be available from the preliminary analyses recommended in Chapter 
VI. 

Workflheet Two is used to arrive at the basic test of significance of the 
project effects. 

Worksheet Three is used to test whether the regression lines for the 
two groups have the same slope. If the F ratio for the regression lines 
IS significant (i.e., the two slopes are not equal) then analysis of 
covarlance should not be used, and the F ratio from Worksheet Two is 
meaningless. Logically, Worksheet Three should be completed before 
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Worksheet Two. Only Iteios (202), (205), and (208) from Worksheet Two 
are needed to complete Worksheet Three. 

Worksheet Four Is used to calculate the adjusted mean posttest scores. 
These adjusted scores are used only to provide an estimate o£ the "real" 
effect of the project. They may be useful In determining "educational" 
significance, but are not Involved In the computation of statistical 
significance. 

Significance Levels : 

Tables of F values are available In McNemar (1969, pp 509-511) and Winer 
(1971, pp 864-868). In McNemar: 



The .05 level of significance Is suggested In this guidebook. Winer 
uses the notation (1-a) « .95 for the .05 level of significance. 

Notation Used on the Worksheets: 



1 " student number 

j « group ID (I.e., j • Treatment (t) or Conq)arlson (c)) 

X^^ " pretest raw score for student 1 of group j 

Y^^ " posttest raw score for studeiii i of group j 

n^ " number of students In group j 

N " total number of students (N « n^ + n^) 



The remaining notation in this appendix follows Winer (1971). It may be 
helpful to note that IUl Winer: 



nx " degrees of freedom (df) for the numerator 
nz " degrees of freedom (df) for the denominator 



S 



refers to "total" variation 



E 



refers to "error" or "within group" variation 



T 



refers to "treatment" or "between groups' 



" variation 
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and ^.hat on page 775 (omitting subscripts) : 



(Ix) - 



N 



(Ixy) - 



N 



(ly) 



N 



(2x) - ZSX2 



(2xy) » ZZXY 



(2y) - EEY2 






Hie double sunmatlon signs (££) Indicate that the values are first eucaned 
over all n^ students In each group, then the two group sums are added 
together. 

On all worksheets, results which are needed for later calculations are 
Identified by a three-digit number. The number (148), for example. In- 
dicates Worksheet One, Column 4, Bow 8. Worksheets Two and Three are 
not divided into colimins, so, for example, (212) Indicates Worksheet Two, 
Row 12. 

There are no mathematica]. checks built Into the worksheets. To Insure 
accuracy It Is essential to have two persons complete the calculations 
Independently. 
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ANALYSIS OF COVARIANCE 
Worksheet Two (Winer, 1971, pp. 775-778) 

Computation of F ratio for the significance of the 
adjusted difference between the Treatment and Comparison groups 





= (142) 


- (141) 




m 


(201) 


E 

XX 


" (142) 


- (143) 






(202) 


T 

XX 


= (201) 


- (202) 




m 


(203) 



xy 



xy 



xy 



(148) - (147) 
(148) - (149) 
(204) - (205) 



(204) 
(205) 
(206) 



s 

yy 


=■ (145) 


- (144) 




" (145) 


- (146) 




« (207) 


- (208) 



(207) 
(208) 
(209) 



S'yy = (207) 



E' 



yy 



yyR 



(208) 
(210) 



(204) 2/(201) 

(205) 2/(202) 
(211) 



(212) [(130)-3] 
(211) 



degrees of freedom: 
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1 



[ ] 
1 numerator 



E(nj-l)"l 



N-3 denominator 
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(213) 
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